Initial implementation by kimi k2 0905
This commit is contained in:
85
PLANNING/Task.md
Normal file
85
PLANNING/Task.md
Normal file
@@ -0,0 +1,85 @@
|
||||
**Time Allocation:** approx 2 hours total
|
||||
**AI Tools Encouraged:** Use any generative AI tools (ChatGPT, Claude, Copilot, etc.) to accelerate development
|
||||
**Deliverables:** Automation pipelines, API integrations, testing frameworks, and deployment configurations
|
||||
|
||||
**Contract Lifecycle Management (CLM) Automation**
|
||||
|
||||
The company is streamlining its Contract Lifecycle Management (CLM) process. Currently, contracts are stored in a disorganized manner across various departments. The goal is to create an intelligent platform that can:
|
||||
|
||||
- Index: Automatically ingest contracts from different sources.
|
||||
- Understand: Extract key information (dates, parties, clauses).
|
||||
- Alert: Identify potential issues (conflicts in contact info, approaching expiration dates).
|
||||
- Provide Access: Make contract information easily accessible to authorized users via a chatbot and daily reports.
|
||||
- Enable Insights: Detect similar contract versions (for version control and review).
|
||||
|
||||
The candidate should generate a synthetic dataset of 10-15 documents of varying types. Include these document formats:
|
||||
|
||||
- PDFs (4-5): Standard contracts, scanned contracts (requiring OCR)
|
||||
- Word Documents (.docx) (3-4): Draft contracts, amendments
|
||||
- Text Files (.txt) (2-3): Contract summaries, email correspondence related to contracts.
|
||||
- Unstructured Text (2): e.g. meeting notes regarding a contract. These should be purposefully less structured to test the candidate's ability to handle complexity.
|
||||
|
||||
Within the documents, there should be:
|
||||
|
||||
- Variations: Several versions of the same contract with minor changes.
|
||||
- Conflicts: Deliberately include conflicting information (e.g., different addresses for the same company, different expiration dates) across different documents.
|
||||
- Key Dates: Include contract creation dates, renewal dates, termination dates, and potentially clauses with specific effective dates.
|
||||
- Metadata: Some documents should have existing metadata (e.g., contract name, department) to test how the candidate integrates metadata into the pipeline. Others should not.
|
||||
|
||||
The candidate should build a system that can:
|
||||
|
||||
- Document Ingestion & Indexing:
|
||||
|
||||
- Load documents from a designated folder (simulating an incoming source).
|
||||
- Use a suitable vector database (e.g., ChromaDB, Pinecone) to store embeddings. Justify the database choice in a brief comment/readme.
|
||||
- Implement basic chunking strategy.
|
||||
|
||||
- RAG Pipeline:
|
||||
|
||||
- Create a RAG pipeline using Langchain or a similar framework. The pipeline should retrieve relevant document chunks based on user queries.
|
||||
|
||||
- AI Agent (Daily Report Generation):
|
||||
|
||||
- Develop an AI agent using Langchain Agents (or similar) that runs daily.
|
||||
- The agent should automatically:
|
||||
- Identify approaching contract expiration dates (within the next 30 days).
|
||||
- Detect conflicting information (e.g., different addresses for the same company in different contracts). The agent must describe what the conflict is and where it is (document names).
|
||||
- Summarize the findings in an email report to a predefined email address (provide a test email address). The email should be formatted clearly and concisely.
|
||||
|
||||
- Chatbot Interface:
|
||||
|
||||
- Create a simple chatbot interface (e.g., using Streamlit, Gradio, or a basic Flask app).
|
||||
- When a user asks a question about a contract, the chatbot should:
|
||||
- Use the RAG pipeline to retrieve relevant document chunks.
|
||||
- Provide the AI answer to the user.
|
||||
- Clearly cite the source documents used to generate the answer (e.g., document name and page number). This is crucial.
|
||||
|
||||
- Document Similarity:
|
||||
|
||||
- Implement a function to find similar documents based on semantic similarity (using embedding similarity). The user should be able to input a document name and receive a list of similar documents.
|
||||
|
||||
- Error Handling & Logging: Implement basic error handling and logging to ensure the system's reliability.
|
||||
|
||||
MCP Server Integration (Bonus): If the candidate has time, ask them to describe how they would integrate the RAG pipeline with an existing MCP server (e.g., using REST APIs). This doesn't necessarily require full implementation but demonstrating understanding of the process.
|
||||
|
||||
Success Criteria & Evaluation
|
||||
|
||||
- Functionality :
|
||||
|
||||
- Document Ingestion & Indexing: Does the system load and index the documents correctly? Are embeddings generated? (10%)
|
||||
- RAG Pipeline: Does the RAG pipeline retrieve relevant information based on user queries? (15%)
|
||||
- AI Agent: Does the agent run daily and generate accurate reports with detected conflicts and approaching expiration dates? (15%)
|
||||
- Chatbot Interface: Does the chatbot provide answers and cite sources correctly? (10%)
|
||||
|
||||
- Code Quality & Design:
|
||||
|
||||
- Readability: Is the code well-formatted and easy to understand?
|
||||
- Modularity: Is the code organized into logical modules?
|
||||
- Documentation: Is the code adequately documented?
|
||||
- Error Handling: Does the code handle errors gracefully?
|
||||
|
||||
- Reasoning & Approach:
|
||||
|
||||
- Framework Choices: Were appropriate frameworks and tools selected? Justification of choices is important.
|
||||
- Problem Solving: Did the candidate demonstrate a logical approach to solving the problem?
|
||||
- Scalability: Did the candidate consider the scalability of the solution? (e.g., vector database choice, chunking strategy)
|
||||
67
PLANNING/asyncio_event_loop_fix.md
Normal file
67
PLANNING/asyncio_event_loop_fix.md
Normal file
@@ -0,0 +1,67 @@
|
||||
# Asyncio Event Loop Issue with Streamlit
|
||||
|
||||
## Problem
|
||||
Error: `There is no current event loop in thread 'ScriptRunner.scriptThread'`
|
||||
|
||||
This occurs because:
|
||||
1. Google's langchain integration uses asyncio internally
|
||||
2. Streamlit runs scripts in a separate thread (ScriptRunner.scriptThread)
|
||||
3. This thread doesn't have an asyncio event loop by default
|
||||
4. When `embed_query()` is called, it tries to use async operations but fails
|
||||
|
||||
## Solution
|
||||
|
||||
### Option 1: Create Event Loop for Thread (Recommended)
|
||||
Add event loop handling in the model factory:
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
import threading
|
||||
|
||||
def _create_google_embeddings() -> GoogleGenerativeAIEmbeddings:
|
||||
"""Create Google embeddings with validation"""
|
||||
if not config.GOOGLE_API_KEY:
|
||||
raise ValueError("GOOGLE_API_KEY not configured")
|
||||
|
||||
# Ensure event loop exists for current thread
|
||||
try:
|
||||
loop = asyncio.get_event_loop()
|
||||
except RuntimeError:
|
||||
# Create new event loop for this thread
|
||||
loop = asyncio.new_event_loop()
|
||||
asyncio.set_event_loop(loop)
|
||||
|
||||
embeddings = GoogleGenerativeAIEmbeddings(
|
||||
model=config.GOOGLE_EMBEDDING_MODEL,
|
||||
google_api_key=config.GOOGLE_API_KEY
|
||||
)
|
||||
|
||||
# Rest of validation...
|
||||
```
|
||||
|
||||
### Option 2: Use nest_asyncio (Simple but less clean)
|
||||
Install and apply nest_asyncio at app startup:
|
||||
|
||||
```python
|
||||
import nest_asyncio
|
||||
nest_asyncio.apply()
|
||||
```
|
||||
|
||||
### Option 3: Synchronous Wrapper
|
||||
Create a synchronous wrapper for async operations:
|
||||
|
||||
```python
|
||||
def sync_embed_query(embeddings, text):
|
||||
"""Synchronous wrapper for async embed_query"""
|
||||
try:
|
||||
loop = asyncio.get_event_loop()
|
||||
except RuntimeError:
|
||||
loop = asyncio.new_event_loop()
|
||||
asyncio.set_event_loop(loop)
|
||||
|
||||
return loop.run_until_complete(embeddings.aembed_query(text))
|
||||
```
|
||||
|
||||
## Recommended Fix
|
||||
|
||||
Update `model_factory.py` in the `_create_google_embeddings` method to handle the event loop properly.
|
||||
53
PLANNING/design.md
Normal file
53
PLANNING/design.md
Normal file
@@ -0,0 +1,53 @@
|
||||
# CLM System Architecture Design
|
||||
|
||||
## Design Patterns
|
||||
|
||||
### 1. Monolithic Architecture
|
||||
Single FastAPI application with modular components:
|
||||
- **Document Ingestion Module**: Handles multiple file formats (PDF, DOCX, TXT)
|
||||
- **RAG Module**: Manages vector embeddings and retrieval
|
||||
- **AI Agent Module**: Daily contract monitoring and reporting
|
||||
- **Chatbot Module**: User interface for contract queries
|
||||
|
||||
### 2. Direct File Operations
|
||||
- Simple utility functions for file I/O
|
||||
- Direct file system operations for document storage
|
||||
- No abstraction layer needed for this scope
|
||||
|
||||
### 3. Direct File Processing
|
||||
- Simple file type detection and processing functions
|
||||
- Direct embedding generation using selected model
|
||||
|
||||
### 4. Strategy Pattern
|
||||
- `ChunkingStrategy`: Basic fixed-size chunking
|
||||
- `EmbeddingModel`: Single model (OpenAI or local)
|
||||
|
||||
### 5. Chain of Responsibility
|
||||
- Document processing pipeline:
|
||||
1. `FileValidator` → 2. `OCRProcessor` → 3. `TextExtractor` → 4. `Chunker` → 5. `Embedder` → 6. `VectorStore`
|
||||
|
||||
### 6. Singleton Pattern
|
||||
- `ConfigurationManager`: Global config access
|
||||
- `VectorDatabaseConnection`: Single connection
|
||||
- `Logger`: Basic error logging
|
||||
|
||||
## Data Flow
|
||||
|
||||
1. **Document Ingestion**: File → Validation → Processing → Storage
|
||||
2. **Query Processing**: User Query → RAG Pipeline → Context Retrieval → Response Generation
|
||||
3. **Daily Monitoring**: Scheduled Trigger → Contract Scan → Conflict Detection → Report Generation
|
||||
|
||||
## Technology Stack
|
||||
|
||||
- **Framework**: FastAPI (async support, automatic docs)
|
||||
- **Vector DB**: ChromaDB (lightweight, easy setup)
|
||||
- **LLM Framework**: LangChain
|
||||
- **Container**: Docker + Docker Compose
|
||||
|
||||
## Implementation Priority
|
||||
|
||||
1. Document ingestion and indexing
|
||||
2. Basic RAG pipeline
|
||||
3. AI agent for daily reports
|
||||
4. Simple chatbot interface
|
||||
5. Document similarity function
|
||||
72
PLANNING/low_level_design.md
Normal file
72
PLANNING/low_level_design.md
Normal file
@@ -0,0 +1,72 @@
|
||||
# CLM System - Low Level Design
|
||||
|
||||
## Minimal Folder Structure (Python + Streamlit)
|
||||
|
||||
```
|
||||
clm-system/
|
||||
├── app.py # Main Streamlit chat interface
|
||||
├── requirements.txt # Dependencies
|
||||
├── config.py # Configuration settings
|
||||
├── data/ # Synthetic contract documents
|
||||
│ ├── contracts/ # PDF, DOCX, TXT files
|
||||
│ └── metadata/ # Document metadata
|
||||
├── src/
|
||||
│ ├── __init__.py
|
||||
│ ├── ingestion.py # Document processing & indexing
|
||||
│ ├── rag.py # RAG pipeline
|
||||
│ ├── agent.py # Manual trigger agent
|
||||
│ └── utils.py # Helper functions
|
||||
├── scripts/
|
||||
│ ├── manual_scan.py # Manual trigger script
|
||||
│ └── generate_reports.py # Report generation script
|
||||
└── tests/ # Basic tests
|
||||
└── test_ingestion.py
|
||||
```
|
||||
|
||||
## Setup Instructions
|
||||
Create the module with: `uv init clm-system --module`
|
||||
|
||||
## Core Components
|
||||
|
||||
### 1. Streamlit Interface (app.py)
|
||||
- Chat interface for contract queries
|
||||
- Document similarity search
|
||||
- Upload new contracts
|
||||
- Manual trigger button for daily scan
|
||||
|
||||
### 2. Document Ingestion (src/ingestion.py)
|
||||
- File validation and type detection
|
||||
- OCR for scanned PDFs
|
||||
- Text extraction from PDF/DOCX/TXT
|
||||
- LanceDB vector storage
|
||||
- Basic chunking strategy
|
||||
|
||||
### 3. RAG Pipeline (src/rag.py)
|
||||
- LangChain retrieval
|
||||
- Context-aware querying
|
||||
- Source citation (document name, page)
|
||||
- Embedding generation
|
||||
|
||||
### 4. Manual Agent (src/agent.py)
|
||||
- Manual trigger via script
|
||||
- Expiration date detection (30-day alert)
|
||||
- Conflict identification
|
||||
- Email report generation
|
||||
|
||||
### 5. Manual Triggers
|
||||
- scripts/manual_scan.py: Run daily scan
|
||||
- scripts/generate_reports.py: Generate reports
|
||||
- Both can be run via cron or manually
|
||||
|
||||
## Technology Stack
|
||||
- **Framework**: Streamlit (chat interface)
|
||||
- **Vector DB**: LanceDB (lightweight, local)
|
||||
- **LLM Framework**: LangChain
|
||||
- **File Processing**: PyPDF2, python-docx
|
||||
- **OCR**: pytesseract
|
||||
- **Email**: smtplib
|
||||
|
||||
## Data Flow
|
||||
1. **Ingestion**: File → Validation → Processing → LanceDB
|
||||
2. **Query**: User Input → RAG → Context Retrieval → Response
|
||||
3. **Manual Scan**: Trigger → Contract Scan → Analysis → Email Report
|
||||
138
PLANNING/smtp.md
Normal file
138
PLANNING/smtp.md
Normal file
@@ -0,0 +1,138 @@
|
||||
# Sendria SMTP Integration Guide
|
||||
|
||||
## Overview
|
||||
Sendria (formerly MailTrap) is a development SMTP server that catches emails and displays them in a web interface instead of sending them to real recipients. Perfect for development/testing environments.
|
||||
|
||||
## Installation
|
||||
|
||||
Sendria is a standalone SMTP server application, not a Python package. Install it separately:
|
||||
|
||||
**Using uv pip (recommended for Python environments):**
|
||||
```bash
|
||||
uv pip install sendria
|
||||
```
|
||||
|
||||
**Using Docker (most reliable):**
|
||||
```bash
|
||||
docker pull ghcr.io/mmbesar/sendria-container:latest
|
||||
docker run -d \
|
||||
--name sendria \
|
||||
-p 1025:1025 \
|
||||
-p 1080:1080 \
|
||||
ghcr.io/mmbesar/sendria-container:latest
|
||||
```
|
||||
|
||||
## Quick Start
|
||||
|
||||
1. **Start Sendria server:**
|
||||
```bash
|
||||
sendria --db mails.sqlite
|
||||
```
|
||||
|
||||
2. **Access the web interface:**
|
||||
- SMTP server: `smtp://127.0.0.1:1025`
|
||||
- Web interface: `http://127.0.0.1:1080`
|
||||
|
||||
## Configuration
|
||||
|
||||
### Update Environment Variables
|
||||
|
||||
Update your `.env` file to use Sendria:
|
||||
|
||||
```bash
|
||||
# Email Configuration for Sendria (development)
|
||||
EMAIL_SMTP_SERVER=127.0.0.1
|
||||
EMAIL_SMTP_PORT=1025
|
||||
EMAIL_USERNAME=
|
||||
EMAIL_PASSWORD=
|
||||
RECIPIENT_EMAIL=admin@example.com
|
||||
```
|
||||
|
||||
### Enable Email Sending in Agent
|
||||
|
||||
In `src/clm_system/agent.py`, modify the `send_email_report` method:
|
||||
|
||||
```python
|
||||
def send_email_report(self, expiring_contracts: list[ContractAlert], conflicts: list[ContractAlert]):
|
||||
"""Send email report with scan results"""
|
||||
try:
|
||||
msg = MIMEMultipart()
|
||||
msg['From'] = self.sender_email
|
||||
msg['To'] = self.recipient_email
|
||||
msg['Subject'] = f"CLM Daily Report - {datetime.now().strftime('%Y-%m-%d')}"
|
||||
|
||||
# Create email body
|
||||
body = self.create_email_body(expiring_contracts, conflicts)
|
||||
msg.attach(MIMEText(body, 'plain'))
|
||||
|
||||
# Send email via Sendria
|
||||
server = smtplib.SMTP(self.smtp_server, self.smtp_port)
|
||||
# No TLS or authentication needed for Sendria
|
||||
server.send_message(msg)
|
||||
server.quit()
|
||||
|
||||
logger.info("Email report sent via Sendria")
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error sending email report: {e}")
|
||||
```
|
||||
|
||||
## Testing
|
||||
|
||||
1. **Start Sendria:**
|
||||
```bash
|
||||
sendria --db mails.sqlite
|
||||
```
|
||||
|
||||
2. **Run your CLM system:**
|
||||
```bash
|
||||
streamlit run app.py
|
||||
```
|
||||
|
||||
3. **Trigger a scan** or wait for scheduled scan
|
||||
|
||||
4. **Check captured emails** at `http://127.0.0.1:1080`
|
||||
|
||||
## Sendria Features
|
||||
|
||||
- **Email Catching**: Captures all emails sent to SMTP port 1025
|
||||
- **Web Interface**: View emails in browser at port 1080
|
||||
- **No Authentication**: Simple setup without credentials
|
||||
- **SQLite Storage**: Emails persist in `mails.sqlite`
|
||||
- **WebSocket Support**: Real-time email updates
|
||||
- **API Access**: RESTful API for programmatic access
|
||||
|
||||
## API Endpoints
|
||||
|
||||
- `GET /api/messages/` - List all emails
|
||||
- `GET /api/messages/{id}.json` - Email metadata
|
||||
- `GET /api/messages/{id}.plain` - Plain text content
|
||||
- `GET /api/messages/{id}.html` - HTML content
|
||||
- `GET /api/messages/{id}.eml` - Download as EML file
|
||||
|
||||
## Production vs Development
|
||||
|
||||
- **Development**: Use Sendria (catches emails locally)
|
||||
- **Production**: Use Gmail SMTP or other real SMTP service
|
||||
|
||||
**Important**: Sendria is only for development/testing. Never use it in production environments.
|
||||
|
||||
## Docker Alternative (Recommended)
|
||||
|
||||
```bash
|
||||
docker pull ghcr.io/mmbesar/sendria-container:latest
|
||||
docker run -d \
|
||||
--name sendria \
|
||||
-p 1025:1025 \
|
||||
-p 1080:1080 \
|
||||
ghcr.io/mmbesar/sendria-container:latest
|
||||
```
|
||||
|
||||
Docker is the recommended approach as it doesn't require system-wide Python package installation.
|
||||
|
||||
## Common Issues
|
||||
|
||||
1. **Port already in use**: Kill existing process on port 1025 or 1080
|
||||
2. **Can't see emails**: Check firewall settings and ensure ports are open
|
||||
3. **Emails not sending**: Verify SMTP settings in your `.env` file
|
||||
4. **Sendria not found**: Ensure it's installed with `uv pip install sendria`
|
||||
6
PLANNING/steps.md
Normal file
6
PLANNING/steps.md
Normal file
@@ -0,0 +1,6 @@
|
||||
Steps taken:
|
||||
- [x] Read and understand the task.
|
||||
- [x] Create a design and select suitable design pattern.
|
||||
- [x] use uv init package
|
||||
- [x] add dev deps for ruff pyright pytest
|
||||
- [x] never use pathlib or os.path directly always use importlib.resources and define resources in pyproject.toml
|
||||
295
PLANNING/streamlit_init_issue.md
Normal file
295
PLANNING/streamlit_init_issue.md
Normal file
@@ -0,0 +1,295 @@
|
||||
# Streamlit RAG Pipeline Initialization Issue
|
||||
|
||||
## Problem Statement
|
||||
When running the Streamlit app, sending "Hi" results in error: `AI models are not properly configured. Please check your API keys.`
|
||||
|
||||
### Root Cause Analysis
|
||||
1. **Embeddings returning None**: The `_initialize_embeddings()` method in `RAGPipeline` returns `None` when initialization fails
|
||||
2. **Silent failures**: Exceptions are caught but only logged as warnings, returning `None` instead of raising errors
|
||||
3. **Streamlit rerun behavior**: Each interaction causes a full script rerun, potentially reinitializing models
|
||||
4. **No persistence**: Models are not stored in `st.session_state`, causing repeated initialization attempts
|
||||
|
||||
### Current Behavior
|
||||
```python
|
||||
# Current problematic flow:
|
||||
RAGPipeline.__init__()
|
||||
→ _initialize_embeddings()
|
||||
→ try/except returns None on failure
|
||||
→ self.embeddings = None
|
||||
→ Query fails with generic error
|
||||
```
|
||||
|
||||
## Proposed Design Changes
|
||||
|
||||
### 1. Singleton Pattern with Session State
|
||||
**Problem**: Multiple RAGPipeline instances created on reruns
|
||||
**Solution**: Use Streamlit session state as singleton storage
|
||||
|
||||
```python
|
||||
# In app.py or streamlit_app.py
|
||||
def get_rag_pipeline():
|
||||
"""Get or create RAG pipeline with proper session state management"""
|
||||
if 'rag_pipeline' not in st.session_state:
|
||||
with st.spinner("Initializing AI models..."):
|
||||
pipeline = RAGPipeline()
|
||||
|
||||
# Validate initialization
|
||||
if pipeline.embeddings is None:
|
||||
st.error("Failed to initialize embedding model")
|
||||
st.stop()
|
||||
if pipeline.llm is None:
|
||||
st.error("Failed to initialize language model")
|
||||
st.stop()
|
||||
|
||||
st.session_state.rag_pipeline = pipeline
|
||||
st.success("AI models initialized successfully")
|
||||
|
||||
return st.session_state.rag_pipeline
|
||||
```
|
||||
|
||||
### 2. Lazy Initialization with Caching
|
||||
**Problem**: Models initialized in `__init__` even if not needed
|
||||
**Solution**: Use lazy properties with `@st.cache_resource`
|
||||
|
||||
```python
|
||||
class RAGPipeline:
|
||||
def __init__(self):
|
||||
self._embeddings = None
|
||||
self._llm = None
|
||||
self.db_path = "data/lancedb"
|
||||
self.db = lancedb.connect(self.db_path)
|
||||
|
||||
@property
|
||||
def embeddings(self):
|
||||
if self._embeddings is None:
|
||||
self._embeddings = self._get_or_create_embeddings()
|
||||
return self._embeddings
|
||||
|
||||
@property
|
||||
def llm(self):
|
||||
if self._llm is None:
|
||||
self._llm = self._get_or_create_llm()
|
||||
return self._llm
|
||||
|
||||
@st.cache_resource
|
||||
def _get_or_create_embeddings(_self):
|
||||
"""Cached embedding model creation"""
|
||||
return _self._initialize_embeddings()
|
||||
|
||||
@st.cache_resource
|
||||
def _get_or_create_llm(_self):
|
||||
"""Cached LLM creation"""
|
||||
return _self._initialize_llm()
|
||||
```
|
||||
|
||||
### 3. Explicit Error Handling
|
||||
**Problem**: Silent failures with `return None`
|
||||
**Solution**: Raise exceptions with clear messages
|
||||
|
||||
```python
|
||||
def _initialize_embeddings(self):
|
||||
"""Initialize embeddings with explicit error handling"""
|
||||
model_type = config.EMBEDDING_MODEL
|
||||
|
||||
try:
|
||||
if model_type == "google":
|
||||
if not config.GOOGLE_API_KEY:
|
||||
raise ValueError("GOOGLE_API_KEY is not set in environment variables")
|
||||
|
||||
# Try to import and initialize
|
||||
try:
|
||||
from langchain_google_genai import GoogleGenerativeAIEmbeddings
|
||||
except ImportError as e:
|
||||
raise ImportError(f"Failed to import GoogleGenerativeAIEmbeddings: {e}")
|
||||
|
||||
embeddings = GoogleGenerativeAIEmbeddings(
|
||||
model=config.GOOGLE_EMBEDDING_MODEL,
|
||||
google_api_key=config.GOOGLE_API_KEY
|
||||
)
|
||||
|
||||
# Test the embeddings work
|
||||
test_embedding = embeddings.embed_query("test")
|
||||
if not test_embedding:
|
||||
raise ValueError("Embeddings returned empty result for test query")
|
||||
|
||||
return embeddings
|
||||
|
||||
elif model_type == "openai":
|
||||
# Similar explicit handling for OpenAI
|
||||
pass
|
||||
|
||||
else:
|
||||
raise ValueError(f"Unsupported embedding model: {model_type}")
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to initialize {model_type} embeddings: {str(e)}")
|
||||
# Re-raise with context
|
||||
raise RuntimeError(f"Embedding initialization failed: {str(e)}") from e
|
||||
```
|
||||
|
||||
### 4. Configuration Validation
|
||||
**Problem**: No upfront validation of configuration
|
||||
**Solution**: Add configuration validator
|
||||
|
||||
```python
|
||||
def validate_ai_config():
|
||||
"""Validate AI configuration before initialization"""
|
||||
errors = []
|
||||
|
||||
# Check embedding model configuration
|
||||
if config.EMBEDDING_MODEL == "google":
|
||||
if not config.GOOGLE_API_KEY:
|
||||
errors.append("GOOGLE_API_KEY not set for Google embedding model")
|
||||
if not config.GOOGLE_EMBEDDING_MODEL:
|
||||
errors.append("GOOGLE_EMBEDDING_MODEL not configured")
|
||||
|
||||
# Check LLM configuration
|
||||
if config.LLM_MODEL == "google":
|
||||
if not config.GOOGLE_API_KEY:
|
||||
errors.append("GOOGLE_API_KEY not set for Google LLM")
|
||||
if not config.GOOGLE_MODEL:
|
||||
errors.append("GOOGLE_MODEL not configured")
|
||||
|
||||
if errors:
|
||||
return False, errors
|
||||
return True, []
|
||||
|
||||
# Use in Streamlit app
|
||||
def initialize_app():
|
||||
valid, errors = validate_ai_config()
|
||||
if not valid:
|
||||
st.error("Configuration errors detected:")
|
||||
for error in errors:
|
||||
st.error(f"• {error}")
|
||||
st.stop()
|
||||
```
|
||||
|
||||
### 5. Model Factory Pattern
|
||||
**Problem**: Model initialization logic mixed with pipeline logic
|
||||
**Solution**: Separate model creation into factory
|
||||
|
||||
```python
|
||||
class ModelFactory:
|
||||
"""Factory for creating AI models with proper error handling"""
|
||||
|
||||
@staticmethod
|
||||
def create_embeddings(model_type: str):
|
||||
"""Create embedding model based on type"""
|
||||
creators = {
|
||||
"google": ModelFactory._create_google_embeddings,
|
||||
"openai": ModelFactory._create_openai_embeddings,
|
||||
"huggingface": ModelFactory._create_huggingface_embeddings
|
||||
}
|
||||
|
||||
creator = creators.get(model_type)
|
||||
if not creator:
|
||||
raise ValueError(f"Unknown embedding model type: {model_type}")
|
||||
|
||||
return creator()
|
||||
|
||||
@staticmethod
|
||||
def _create_google_embeddings():
|
||||
"""Create Google embeddings with validation"""
|
||||
if not config.GOOGLE_API_KEY:
|
||||
raise ValueError("GOOGLE_API_KEY not configured")
|
||||
|
||||
from langchain_google_genai import GoogleGenerativeAIEmbeddings
|
||||
|
||||
embeddings = GoogleGenerativeAIEmbeddings(
|
||||
model=config.GOOGLE_EMBEDDING_MODEL,
|
||||
google_api_key=config.GOOGLE_API_KEY
|
||||
)
|
||||
|
||||
# Validate it works
|
||||
try:
|
||||
test_result = embeddings.embed_query("test")
|
||||
if not test_result or len(test_result) == 0:
|
||||
raise ValueError("Embeddings test failed")
|
||||
except Exception as e:
|
||||
raise ValueError(f"Embeddings validation failed: {e}")
|
||||
|
||||
return embeddings
|
||||
```
|
||||
|
||||
## Implementation Steps
|
||||
|
||||
1. **Update RAGPipeline class** (`src/clm_system/rag.py`)
|
||||
- Remove direct initialization in `__init__`
|
||||
- Add lazy property getters
|
||||
- Implement explicit error handling
|
||||
- Remove all `return None` patterns
|
||||
|
||||
2. **Create ModelFactory** (`src/clm_system/model_factory.py`)
|
||||
- Implement factory methods for each model type
|
||||
- Add validation for each model
|
||||
- Include test queries to verify models work
|
||||
|
||||
3. **Update Streamlit app** (`debug_streamlit.py` or create new `app.py`)
|
||||
- Use session state for pipeline storage
|
||||
- Add configuration validation on startup
|
||||
- Show clear error messages to users
|
||||
- Add retry mechanism for transient failures
|
||||
|
||||
4. **Add configuration validator** (`src/clm_system/validators.py`)
|
||||
- Check all required environment variables
|
||||
- Validate API keys format (if applicable)
|
||||
- Test connectivity to services
|
||||
|
||||
5. **Update config.py**
|
||||
- Add helper methods for validation
|
||||
- Include default fallbacks where appropriate
|
||||
- Better error messages for missing values
|
||||
|
||||
## Testing Plan
|
||||
|
||||
1. **Unit Tests**
|
||||
- Test model initialization with valid config
|
||||
- Test model initialization with missing config
|
||||
- Test error handling and messages
|
||||
|
||||
2. **Integration Tests**
|
||||
- Test full pipeline initialization
|
||||
- Test Streamlit session state persistence
|
||||
- Test recovery from failures
|
||||
|
||||
3. **Manual Testing**
|
||||
- Start app with valid config → should work
|
||||
- Start app with missing API key → clear error
|
||||
- Send query after initialization → should respond
|
||||
- Refresh page → should maintain state
|
||||
|
||||
## Success Criteria
|
||||
|
||||
1. ✅ Clear error messages when configuration is invalid
|
||||
2. ✅ Models initialize once and persist in session state
|
||||
3. ✅ No silent failures (no `return None`)
|
||||
4. ✅ App handles "Hi" message successfully
|
||||
5. ✅ Page refreshes don't reinitialize models
|
||||
6. ✅ Failed initialization stops app with helpful message
|
||||
|
||||
## Rollback Plan
|
||||
|
||||
If changes cause issues:
|
||||
1. Revert to original code
|
||||
2. Add temporary workaround in Streamlit app:
|
||||
```python
|
||||
if 'rag_pipeline' not in st.session_state:
|
||||
# Force multiple initialization attempts
|
||||
for attempt in range(3):
|
||||
try:
|
||||
pipeline = RAGPipeline()
|
||||
if pipeline.embeddings and pipeline.llm:
|
||||
st.session_state.rag_pipeline = pipeline
|
||||
break
|
||||
except Exception as e:
|
||||
if attempt == 2:
|
||||
st.error(f"Failed after 3 attempts: {e}")
|
||||
st.stop()
|
||||
```
|
||||
|
||||
## Notes
|
||||
|
||||
- Current code uses Google models (`EMBEDDING_MODEL=google`, `LLM_MODEL=google`)
|
||||
- Google API key is set in environment
|
||||
- Issue only occurs in Streamlit, CLI works fine
|
||||
- Root cause: Streamlit's execution model + silent failures in initialization
|
||||
Reference in New Issue
Block a user