# CLM System - Low Level Design ## Minimal Folder Structure (Python + Streamlit) ``` clm-system/ ├── app.py # Main Streamlit chat interface ├── requirements.txt # Dependencies ├── config.py # Configuration settings ├── data/ # Synthetic contract documents │ ├── contracts/ # PDF, DOCX, TXT files │ └── metadata/ # Document metadata ├── src/ │ ├── __init__.py │ ├── ingestion.py # Document processing & indexing │ ├── rag.py # RAG pipeline │ ├── agent.py # Manual trigger agent │ └── utils.py # Helper functions ├── scripts/ │ ├── manual_scan.py # Manual trigger script │ └── generate_reports.py # Report generation script └── tests/ # Basic tests └── test_ingestion.py ``` ## Setup Instructions Create the module with: `uv init clm-system --module` ## Core Components ### 1. Streamlit Interface (app.py) - Chat interface for contract queries - Document similarity search - Upload new contracts - Manual trigger button for daily scan ### 2. Document Ingestion (src/ingestion.py) - File validation and type detection - OCR for scanned PDFs - Text extraction from PDF/DOCX/TXT - LanceDB vector storage - Basic chunking strategy ### 3. RAG Pipeline (src/rag.py) - LangChain retrieval - Context-aware querying - Source citation (document name, page) - Embedding generation ### 4. Manual Agent (src/agent.py) - Manual trigger via script - Expiration date detection (30-day alert) - Conflict identification - Email report generation ### 5. Manual Triggers - scripts/manual_scan.py: Run daily scan - scripts/generate_reports.py: Generate reports - Both can be run via cron or manually ## Technology Stack - **Framework**: Streamlit (chat interface) - **Vector DB**: LanceDB (lightweight, local) - **LLM Framework**: LangChain - **File Processing**: PyPDF2, python-docx - **OCR**: pytesseract - **Email**: smtplib ## Data Flow 1. **Ingestion**: File → Validation → Processing → LanceDB 2. **Query**: User Input → RAG → Context Retrieval → Response 3. **Manual Scan**: Trigger → Contract Scan → Analysis → Email Report