clm-system/PLANNING/low_level_design.md

# CLM System - Low Level Design

## Minimal Folder Structure (Python + Streamlit)

```
clm-system/
├── app.py                 # Main Streamlit chat interface
├── requirements.txt       # Dependencies
├── config.py             # Configuration settings
├── data/                 # Synthetic contract documents
│   ├── contracts/        # PDF, DOCX, TXT files
│   └── metadata/         # Document metadata
├── src/
│   ├── __init__.py
│   ├── ingestion.py       # Document processing & indexing
│   ├── rag.py            # RAG pipeline
│   ├── agent.py          # Manual trigger agent
│   └── utils.py          # Helper functions
├── scripts/
│   ├── manual_scan.py     # Manual trigger script
│   └── generate_reports.py # Report generation script
└── tests/                # Basic tests
    └── test_ingestion.py
```

## Setup Instructions
Create the module with: `uv init clm-system --module`

## Core Components

### 1. Streamlit Interface (app.py)
- Chat interface for contract queries
- Document similarity search
- Upload new contracts
- Manual trigger button for daily scan

### 2. Document Ingestion (src/ingestion.py)
- File validation and type detection
- OCR for scanned PDFs
- Text extraction from PDF/DOCX/TXT
- LanceDB vector storage
- Basic chunking strategy

### 3. RAG Pipeline (src/rag.py)
- LangChain retrieval
- Context-aware querying
- Source citation (document name, page)
- Embedding generation

### 4. Manual Agent (src/agent.py)
- Manual trigger via script
- Expiration date detection (30-day alert)
- Conflict identification
- Email report generation

### 5. Manual Triggers
- scripts/manual_scan.py: Run daily scan
- scripts/generate_reports.py: Generate reports
- Both can be run via cron or manually

## Technology Stack
- **Framework**: Streamlit (chat interface)
- **Vector DB**: LanceDB (lightweight, local)
- **LLM Framework**: LangChain
- **File Processing**: PyPDF2, python-docx
- **OCR**: pytesseract
- **Email**: smtplib

## Data Flow
1. **Ingestion**: File → Validation → Processing → LanceDB
2. **Query**: User Input → RAG → Context Retrieval → Response
3. **Manual Scan**: Trigger → Contract Scan → Analysis → Email Report