Initial implementation by kimi k2 0905

This commit is contained in:
2025-09-06 10:47:22 +05:30
commit bfb761238f
35 changed files with 8037 additions and 0 deletions

85
PLANNING/Task.md Normal file
View File

@@ -0,0 +1,85 @@
**Time Allocation:** approx 2 hours total
**AI Tools Encouraged:** Use any generative AI tools (ChatGPT, Claude, Copilot, etc.) to accelerate development
**Deliverables:** Automation pipelines, API integrations, testing frameworks, and deployment configurations
 **Contract Lifecycle Management (CLM) Automation**
The company is streamlining its Contract Lifecycle Management (CLM) process.  Currently, contracts are stored in a disorganized manner across various departments.  The goal is to create an intelligent platform that can:
- Index: Automatically ingest contracts from different sources.
- Understand: Extract key information (dates, parties, clauses).
- Alert: Identify potential issues (conflicts in contact info, approaching expiration dates).
- Provide Access: Make contract information easily accessible to authorized users via a chatbot and daily reports.
- Enable Insights: Detect similar contract versions (for version control and review).
The candidate should generate a synthetic dataset of 10-15 documents of varying types.  Include these document formats:
- PDFs (4-5):  Standard contracts, scanned contracts (requiring OCR)
- Word Documents (.docx) (3-4):  Draft contracts, amendments
- Text Files (.txt) (2-3):  Contract summaries, email correspondence related to contracts.
- Unstructured Text (2): e.g. meeting notes regarding a contract.  These should be purposefully less structured to test the candidate's ability to handle complexity.
Within the documents, there should be:
- Variations:  Several versions of the same contract with minor changes.
- Conflicts:  Deliberately include conflicting information (e.g., different addresses for the same company, different expiration dates) across different documents.
- Key Dates: Include contract creation dates, renewal dates, termination dates, and potentially clauses with specific effective dates.
- Metadata:  Some documents should have existing metadata (e.g., contract name, department) to test how the candidate integrates metadata into the pipeline.  Others should not. 
The candidate should build a system that can:
- Document Ingestion & Indexing:
- Load documents from a designated folder (simulating an incoming source).
- Use a suitable vector database (e.g., ChromaDB, Pinecone) to store embeddings.  Justify the database choice in a brief comment/readme.
- Implement basic chunking strategy.
- RAG Pipeline:
- Create a RAG pipeline using Langchain or a similar framework.  The pipeline should retrieve relevant document chunks based on user queries.
- AI Agent (Daily Report Generation):
- Develop an AI agent using Langchain Agents (or similar) that runs daily.
- The agent should automatically:
- Identify approaching contract expiration dates (within the next 30 days).
- Detect conflicting information (e.g., different addresses for the same company in different contracts).  The agent must describe what the conflict is and where it is (document names).
- Summarize the findings in an email report to a predefined email address (provide a test email address).  The email should be formatted clearly and concisely.
- Chatbot Interface:
- Create a simple chatbot interface (e.g., using Streamlit, Gradio, or a basic Flask app).
- When a user asks a question about a contract, the chatbot should:
- Use the RAG pipeline to retrieve relevant document chunks.
- Provide the AI answer to the user.
- Clearly cite the source documents used to generate the answer (e.g., document name and page number).  This is crucial.
- Document Similarity:
- Implement a function to find similar documents based on semantic similarity (using embedding similarity).  The user should be able to input a document name and receive a list of similar documents.
- Error Handling & Logging: Implement basic error handling and logging to ensure the system's reliability.
MCP Server Integration (Bonus): If the candidate has time, ask them to describe how they would integrate the RAG pipeline with an existing MCP server (e.g., using REST APIs). This doesn't necessarily require full implementation but demonstrating understanding of the process.
Success Criteria & Evaluation
- Functionality :
- Document Ingestion & Indexing:  Does the system load and index the documents correctly?  Are embeddings generated? (10%)
- RAG Pipeline: Does the RAG pipeline retrieve relevant information based on user queries?  (15%)
- AI Agent: Does the agent run daily and generate accurate reports with detected conflicts and approaching expiration dates? (15%)
- Chatbot Interface: Does the chatbot provide answers and cite sources correctly? (10%)
- Code Quality & Design:
- Readability: Is the code well-formatted and easy to understand?
- Modularity: Is the code organized into logical modules?
- Documentation: Is the code adequately documented?
- Error Handling: Does the code handle errors gracefully?
- Reasoning & Approach:
- Framework Choices:  Were appropriate frameworks and tools selected? Justification of choices is important.
- Problem Solving:  Did the candidate demonstrate a logical approach to solving the problem?
- Scalability:  Did the candidate consider the scalability of the solution? (e.g., vector database choice, chunking strategy)

View File

@@ -0,0 +1,67 @@
# Asyncio Event Loop Issue with Streamlit
## Problem
Error: `There is no current event loop in thread 'ScriptRunner.scriptThread'`
This occurs because:
1. Google's langchain integration uses asyncio internally
2. Streamlit runs scripts in a separate thread (ScriptRunner.scriptThread)
3. This thread doesn't have an asyncio event loop by default
4. When `embed_query()` is called, it tries to use async operations but fails
## Solution
### Option 1: Create Event Loop for Thread (Recommended)
Add event loop handling in the model factory:
```python
import asyncio
import threading
def _create_google_embeddings() -> GoogleGenerativeAIEmbeddings:
"""Create Google embeddings with validation"""
if not config.GOOGLE_API_KEY:
raise ValueError("GOOGLE_API_KEY not configured")
# Ensure event loop exists for current thread
try:
loop = asyncio.get_event_loop()
except RuntimeError:
# Create new event loop for this thread
loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)
embeddings = GoogleGenerativeAIEmbeddings(
model=config.GOOGLE_EMBEDDING_MODEL,
google_api_key=config.GOOGLE_API_KEY
)
# Rest of validation...
```
### Option 2: Use nest_asyncio (Simple but less clean)
Install and apply nest_asyncio at app startup:
```python
import nest_asyncio
nest_asyncio.apply()
```
### Option 3: Synchronous Wrapper
Create a synchronous wrapper for async operations:
```python
def sync_embed_query(embeddings, text):
"""Synchronous wrapper for async embed_query"""
try:
loop = asyncio.get_event_loop()
except RuntimeError:
loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)
return loop.run_until_complete(embeddings.aembed_query(text))
```
## Recommended Fix
Update `model_factory.py` in the `_create_google_embeddings` method to handle the event loop properly.

53
PLANNING/design.md Normal file
View File

@@ -0,0 +1,53 @@
# CLM System Architecture Design
## Design Patterns
### 1. Monolithic Architecture
Single FastAPI application with modular components:
- **Document Ingestion Module**: Handles multiple file formats (PDF, DOCX, TXT)
- **RAG Module**: Manages vector embeddings and retrieval
- **AI Agent Module**: Daily contract monitoring and reporting
- **Chatbot Module**: User interface for contract queries
### 2. Direct File Operations
- Simple utility functions for file I/O
- Direct file system operations for document storage
- No abstraction layer needed for this scope
### 3. Direct File Processing
- Simple file type detection and processing functions
- Direct embedding generation using selected model
### 4. Strategy Pattern
- `ChunkingStrategy`: Basic fixed-size chunking
- `EmbeddingModel`: Single model (OpenAI or local)
### 5. Chain of Responsibility
- Document processing pipeline:
1. `FileValidator` → 2. `OCRProcessor` → 3. `TextExtractor` → 4. `Chunker` → 5. `Embedder` → 6. `VectorStore`
### 6. Singleton Pattern
- `ConfigurationManager`: Global config access
- `VectorDatabaseConnection`: Single connection
- `Logger`: Basic error logging
## Data Flow
1. **Document Ingestion**: File → Validation → Processing → Storage
2. **Query Processing**: User Query → RAG Pipeline → Context Retrieval → Response Generation
3. **Daily Monitoring**: Scheduled Trigger → Contract Scan → Conflict Detection → Report Generation
## Technology Stack
- **Framework**: FastAPI (async support, automatic docs)
- **Vector DB**: ChromaDB (lightweight, easy setup)
- **LLM Framework**: LangChain
- **Container**: Docker + Docker Compose
## Implementation Priority
1. Document ingestion and indexing
2. Basic RAG pipeline
3. AI agent for daily reports
4. Simple chatbot interface
5. Document similarity function

View File

@@ -0,0 +1,72 @@
# CLM System - Low Level Design
## Minimal Folder Structure (Python + Streamlit)
```
clm-system/
├── app.py # Main Streamlit chat interface
├── requirements.txt # Dependencies
├── config.py # Configuration settings
├── data/ # Synthetic contract documents
│ ├── contracts/ # PDF, DOCX, TXT files
│ └── metadata/ # Document metadata
├── src/
│ ├── __init__.py
│ ├── ingestion.py # Document processing & indexing
│ ├── rag.py # RAG pipeline
│ ├── agent.py # Manual trigger agent
│ └── utils.py # Helper functions
├── scripts/
│ ├── manual_scan.py # Manual trigger script
│ └── generate_reports.py # Report generation script
└── tests/ # Basic tests
└── test_ingestion.py
```
## Setup Instructions
Create the module with: `uv init clm-system --module`
## Core Components
### 1. Streamlit Interface (app.py)
- Chat interface for contract queries
- Document similarity search
- Upload new contracts
- Manual trigger button for daily scan
### 2. Document Ingestion (src/ingestion.py)
- File validation and type detection
- OCR for scanned PDFs
- Text extraction from PDF/DOCX/TXT
- LanceDB vector storage
- Basic chunking strategy
### 3. RAG Pipeline (src/rag.py)
- LangChain retrieval
- Context-aware querying
- Source citation (document name, page)
- Embedding generation
### 4. Manual Agent (src/agent.py)
- Manual trigger via script
- Expiration date detection (30-day alert)
- Conflict identification
- Email report generation
### 5. Manual Triggers
- scripts/manual_scan.py: Run daily scan
- scripts/generate_reports.py: Generate reports
- Both can be run via cron or manually
## Technology Stack
- **Framework**: Streamlit (chat interface)
- **Vector DB**: LanceDB (lightweight, local)
- **LLM Framework**: LangChain
- **File Processing**: PyPDF2, python-docx
- **OCR**: pytesseract
- **Email**: smtplib
## Data Flow
1. **Ingestion**: File → Validation → Processing → LanceDB
2. **Query**: User Input → RAG → Context Retrieval → Response
3. **Manual Scan**: Trigger → Contract Scan → Analysis → Email Report

138
PLANNING/smtp.md Normal file
View File

@@ -0,0 +1,138 @@
# Sendria SMTP Integration Guide
## Overview
Sendria (formerly MailTrap) is a development SMTP server that catches emails and displays them in a web interface instead of sending them to real recipients. Perfect for development/testing environments.
## Installation
Sendria is a standalone SMTP server application, not a Python package. Install it separately:
**Using uv pip (recommended for Python environments):**
```bash
uv pip install sendria
```
**Using Docker (most reliable):**
```bash
docker pull ghcr.io/mmbesar/sendria-container:latest
docker run -d \
--name sendria \
-p 1025:1025 \
-p 1080:1080 \
ghcr.io/mmbesar/sendria-container:latest
```
## Quick Start
1. **Start Sendria server:**
```bash
sendria --db mails.sqlite
```
2. **Access the web interface:**
- SMTP server: `smtp://127.0.0.1:1025`
- Web interface: `http://127.0.0.1:1080`
## Configuration
### Update Environment Variables
Update your `.env` file to use Sendria:
```bash
# Email Configuration for Sendria (development)
EMAIL_SMTP_SERVER=127.0.0.1
EMAIL_SMTP_PORT=1025
EMAIL_USERNAME=
EMAIL_PASSWORD=
RECIPIENT_EMAIL=admin@example.com
```
### Enable Email Sending in Agent
In `src/clm_system/agent.py`, modify the `send_email_report` method:
```python
def send_email_report(self, expiring_contracts: list[ContractAlert], conflicts: list[ContractAlert]):
"""Send email report with scan results"""
try:
msg = MIMEMultipart()
msg['From'] = self.sender_email
msg['To'] = self.recipient_email
msg['Subject'] = f"CLM Daily Report - {datetime.now().strftime('%Y-%m-%d')}"
# Create email body
body = self.create_email_body(expiring_contracts, conflicts)
msg.attach(MIMEText(body, 'plain'))
# Send email via Sendria
server = smtplib.SMTP(self.smtp_server, self.smtp_port)
# No TLS or authentication needed for Sendria
server.send_message(msg)
server.quit()
logger.info("Email report sent via Sendria")
except Exception as e:
logger.error(f"Error sending email report: {e}")
```
## Testing
1. **Start Sendria:**
```bash
sendria --db mails.sqlite
```
2. **Run your CLM system:**
```bash
streamlit run app.py
```
3. **Trigger a scan** or wait for scheduled scan
4. **Check captured emails** at `http://127.0.0.1:1080`
## Sendria Features
- **Email Catching**: Captures all emails sent to SMTP port 1025
- **Web Interface**: View emails in browser at port 1080
- **No Authentication**: Simple setup without credentials
- **SQLite Storage**: Emails persist in `mails.sqlite`
- **WebSocket Support**: Real-time email updates
- **API Access**: RESTful API for programmatic access
## API Endpoints
- `GET /api/messages/` - List all emails
- `GET /api/messages/{id}.json` - Email metadata
- `GET /api/messages/{id}.plain` - Plain text content
- `GET /api/messages/{id}.html` - HTML content
- `GET /api/messages/{id}.eml` - Download as EML file
## Production vs Development
- **Development**: Use Sendria (catches emails locally)
- **Production**: Use Gmail SMTP or other real SMTP service
**Important**: Sendria is only for development/testing. Never use it in production environments.
## Docker Alternative (Recommended)
```bash
docker pull ghcr.io/mmbesar/sendria-container:latest
docker run -d \
--name sendria \
-p 1025:1025 \
-p 1080:1080 \
ghcr.io/mmbesar/sendria-container:latest
```
Docker is the recommended approach as it doesn't require system-wide Python package installation.
## Common Issues
1. **Port already in use**: Kill existing process on port 1025 or 1080
2. **Can't see emails**: Check firewall settings and ensure ports are open
3. **Emails not sending**: Verify SMTP settings in your `.env` file
4. **Sendria not found**: Ensure it's installed with `uv pip install sendria`

6
PLANNING/steps.md Normal file
View File

@@ -0,0 +1,6 @@
Steps taken:
- [x] Read and understand the task.
- [x] Create a design and select suitable design pattern.
- [x] use uv init package
- [x] add dev deps for ruff pyright pytest
- [x] never use pathlib or os.path directly always use importlib.resources and define resources in pyproject.toml

View File

@@ -0,0 +1,295 @@
# Streamlit RAG Pipeline Initialization Issue
## Problem Statement
When running the Streamlit app, sending "Hi" results in error: `AI models are not properly configured. Please check your API keys.`
### Root Cause Analysis
1. **Embeddings returning None**: The `_initialize_embeddings()` method in `RAGPipeline` returns `None` when initialization fails
2. **Silent failures**: Exceptions are caught but only logged as warnings, returning `None` instead of raising errors
3. **Streamlit rerun behavior**: Each interaction causes a full script rerun, potentially reinitializing models
4. **No persistence**: Models are not stored in `st.session_state`, causing repeated initialization attempts
### Current Behavior
```python
# Current problematic flow:
RAGPipeline.__init__()
_initialize_embeddings()
try/except returns None on failure
self.embeddings = None
Query fails with generic error
```
## Proposed Design Changes
### 1. Singleton Pattern with Session State
**Problem**: Multiple RAGPipeline instances created on reruns
**Solution**: Use Streamlit session state as singleton storage
```python
# In app.py or streamlit_app.py
def get_rag_pipeline():
"""Get or create RAG pipeline with proper session state management"""
if 'rag_pipeline' not in st.session_state:
with st.spinner("Initializing AI models..."):
pipeline = RAGPipeline()
# Validate initialization
if pipeline.embeddings is None:
st.error("Failed to initialize embedding model")
st.stop()
if pipeline.llm is None:
st.error("Failed to initialize language model")
st.stop()
st.session_state.rag_pipeline = pipeline
st.success("AI models initialized successfully")
return st.session_state.rag_pipeline
```
### 2. Lazy Initialization with Caching
**Problem**: Models initialized in `__init__` even if not needed
**Solution**: Use lazy properties with `@st.cache_resource`
```python
class RAGPipeline:
def __init__(self):
self._embeddings = None
self._llm = None
self.db_path = "data/lancedb"
self.db = lancedb.connect(self.db_path)
@property
def embeddings(self):
if self._embeddings is None:
self._embeddings = self._get_or_create_embeddings()
return self._embeddings
@property
def llm(self):
if self._llm is None:
self._llm = self._get_or_create_llm()
return self._llm
@st.cache_resource
def _get_or_create_embeddings(_self):
"""Cached embedding model creation"""
return _self._initialize_embeddings()
@st.cache_resource
def _get_or_create_llm(_self):
"""Cached LLM creation"""
return _self._initialize_llm()
```
### 3. Explicit Error Handling
**Problem**: Silent failures with `return None`
**Solution**: Raise exceptions with clear messages
```python
def _initialize_embeddings(self):
"""Initialize embeddings with explicit error handling"""
model_type = config.EMBEDDING_MODEL
try:
if model_type == "google":
if not config.GOOGLE_API_KEY:
raise ValueError("GOOGLE_API_KEY is not set in environment variables")
# Try to import and initialize
try:
from langchain_google_genai import GoogleGenerativeAIEmbeddings
except ImportError as e:
raise ImportError(f"Failed to import GoogleGenerativeAIEmbeddings: {e}")
embeddings = GoogleGenerativeAIEmbeddings(
model=config.GOOGLE_EMBEDDING_MODEL,
google_api_key=config.GOOGLE_API_KEY
)
# Test the embeddings work
test_embedding = embeddings.embed_query("test")
if not test_embedding:
raise ValueError("Embeddings returned empty result for test query")
return embeddings
elif model_type == "openai":
# Similar explicit handling for OpenAI
pass
else:
raise ValueError(f"Unsupported embedding model: {model_type}")
except Exception as e:
logger.error(f"Failed to initialize {model_type} embeddings: {str(e)}")
# Re-raise with context
raise RuntimeError(f"Embedding initialization failed: {str(e)}") from e
```
### 4. Configuration Validation
**Problem**: No upfront validation of configuration
**Solution**: Add configuration validator
```python
def validate_ai_config():
"""Validate AI configuration before initialization"""
errors = []
# Check embedding model configuration
if config.EMBEDDING_MODEL == "google":
if not config.GOOGLE_API_KEY:
errors.append("GOOGLE_API_KEY not set for Google embedding model")
if not config.GOOGLE_EMBEDDING_MODEL:
errors.append("GOOGLE_EMBEDDING_MODEL not configured")
# Check LLM configuration
if config.LLM_MODEL == "google":
if not config.GOOGLE_API_KEY:
errors.append("GOOGLE_API_KEY not set for Google LLM")
if not config.GOOGLE_MODEL:
errors.append("GOOGLE_MODEL not configured")
if errors:
return False, errors
return True, []
# Use in Streamlit app
def initialize_app():
valid, errors = validate_ai_config()
if not valid:
st.error("Configuration errors detected:")
for error in errors:
st.error(f"• {error}")
st.stop()
```
### 5. Model Factory Pattern
**Problem**: Model initialization logic mixed with pipeline logic
**Solution**: Separate model creation into factory
```python
class ModelFactory:
"""Factory for creating AI models with proper error handling"""
@staticmethod
def create_embeddings(model_type: str):
"""Create embedding model based on type"""
creators = {
"google": ModelFactory._create_google_embeddings,
"openai": ModelFactory._create_openai_embeddings,
"huggingface": ModelFactory._create_huggingface_embeddings
}
creator = creators.get(model_type)
if not creator:
raise ValueError(f"Unknown embedding model type: {model_type}")
return creator()
@staticmethod
def _create_google_embeddings():
"""Create Google embeddings with validation"""
if not config.GOOGLE_API_KEY:
raise ValueError("GOOGLE_API_KEY not configured")
from langchain_google_genai import GoogleGenerativeAIEmbeddings
embeddings = GoogleGenerativeAIEmbeddings(
model=config.GOOGLE_EMBEDDING_MODEL,
google_api_key=config.GOOGLE_API_KEY
)
# Validate it works
try:
test_result = embeddings.embed_query("test")
if not test_result or len(test_result) == 0:
raise ValueError("Embeddings test failed")
except Exception as e:
raise ValueError(f"Embeddings validation failed: {e}")
return embeddings
```
## Implementation Steps
1. **Update RAGPipeline class** (`src/clm_system/rag.py`)
- Remove direct initialization in `__init__`
- Add lazy property getters
- Implement explicit error handling
- Remove all `return None` patterns
2. **Create ModelFactory** (`src/clm_system/model_factory.py`)
- Implement factory methods for each model type
- Add validation for each model
- Include test queries to verify models work
3. **Update Streamlit app** (`debug_streamlit.py` or create new `app.py`)
- Use session state for pipeline storage
- Add configuration validation on startup
- Show clear error messages to users
- Add retry mechanism for transient failures
4. **Add configuration validator** (`src/clm_system/validators.py`)
- Check all required environment variables
- Validate API keys format (if applicable)
- Test connectivity to services
5. **Update config.py**
- Add helper methods for validation
- Include default fallbacks where appropriate
- Better error messages for missing values
## Testing Plan
1. **Unit Tests**
- Test model initialization with valid config
- Test model initialization with missing config
- Test error handling and messages
2. **Integration Tests**
- Test full pipeline initialization
- Test Streamlit session state persistence
- Test recovery from failures
3. **Manual Testing**
- Start app with valid config → should work
- Start app with missing API key → clear error
- Send query after initialization → should respond
- Refresh page → should maintain state
## Success Criteria
1. ✅ Clear error messages when configuration is invalid
2. ✅ Models initialize once and persist in session state
3. ✅ No silent failures (no `return None`)
4. ✅ App handles "Hi" message successfully
5. ✅ Page refreshes don't reinitialize models
6. ✅ Failed initialization stops app with helpful message
## Rollback Plan
If changes cause issues:
1. Revert to original code
2. Add temporary workaround in Streamlit app:
```python
if 'rag_pipeline' not in st.session_state:
# Force multiple initialization attempts
for attempt in range(3):
try:
pipeline = RAGPipeline()
if pipeline.embeddings and pipeline.llm:
st.session_state.rag_pipeline = pipeline
break
except Exception as e:
if attempt == 2:
st.error(f"Failed after 3 attempts: {e}")
st.stop()
```
## Notes
- Current code uses Google models (`EMBEDDING_MODEL=google`, `LLM_MODEL=google`)
- Google API key is set in environment
- Issue only occurs in Streamlit, CLI works fine
- Root cause: Streamlit's execution model + silent failures in initialization