Files
clm-system/PLANNING/low_level_design.md

72 lines
2.2 KiB
Markdown

# CLM System - Low Level Design
## Minimal Folder Structure (Python + Streamlit)
```
clm-system/
├── app.py # Main Streamlit chat interface
├── requirements.txt # Dependencies
├── config.py # Configuration settings
├── data/ # Synthetic contract documents
│ ├── contracts/ # PDF, DOCX, TXT files
│ └── metadata/ # Document metadata
├── src/
│ ├── __init__.py
│ ├── ingestion.py # Document processing & indexing
│ ├── rag.py # RAG pipeline
│ ├── agent.py # Manual trigger agent
│ └── utils.py # Helper functions
├── scripts/
│ ├── manual_scan.py # Manual trigger script
│ └── generate_reports.py # Report generation script
└── tests/ # Basic tests
└── test_ingestion.py
```
## Setup Instructions
Create the module with: `uv init clm-system --module`
## Core Components
### 1. Streamlit Interface (app.py)
- Chat interface for contract queries
- Document similarity search
- Upload new contracts
- Manual trigger button for daily scan
### 2. Document Ingestion (src/ingestion.py)
- File validation and type detection
- OCR for scanned PDFs
- Text extraction from PDF/DOCX/TXT
- LanceDB vector storage
- Basic chunking strategy
### 3. RAG Pipeline (src/rag.py)
- LangChain retrieval
- Context-aware querying
- Source citation (document name, page)
- Embedding generation
### 4. Manual Agent (src/agent.py)
- Manual trigger via script
- Expiration date detection (30-day alert)
- Conflict identification
- Email report generation
### 5. Manual Triggers
- scripts/manual_scan.py: Run daily scan
- scripts/generate_reports.py: Generate reports
- Both can be run via cron or manually
## Technology Stack
- **Framework**: Streamlit (chat interface)
- **Vector DB**: LanceDB (lightweight, local)
- **LLM Framework**: LangChain
- **File Processing**: PyPDF2, python-docx
- **OCR**: pytesseract
- **Email**: smtplib
## Data Flow
1. **Ingestion**: File → Validation → Processing → LanceDB
2. **Query**: User Input → RAG → Context Retrieval → Response
3. **Manual Scan**: Trigger → Contract Scan → Analysis → Email Report