72 lines
2.2 KiB
Markdown
72 lines
2.2 KiB
Markdown
# CLM System - Low Level Design
|
|
|
|
## Minimal Folder Structure (Python + Streamlit)
|
|
|
|
```
|
|
clm-system/
|
|
├── app.py # Main Streamlit chat interface
|
|
├── requirements.txt # Dependencies
|
|
├── config.py # Configuration settings
|
|
├── data/ # Synthetic contract documents
|
|
│ ├── contracts/ # PDF, DOCX, TXT files
|
|
│ └── metadata/ # Document metadata
|
|
├── src/
|
|
│ ├── __init__.py
|
|
│ ├── ingestion.py # Document processing & indexing
|
|
│ ├── rag.py # RAG pipeline
|
|
│ ├── agent.py # Manual trigger agent
|
|
│ └── utils.py # Helper functions
|
|
├── scripts/
|
|
│ ├── manual_scan.py # Manual trigger script
|
|
│ └── generate_reports.py # Report generation script
|
|
└── tests/ # Basic tests
|
|
└── test_ingestion.py
|
|
```
|
|
|
|
## Setup Instructions
|
|
Create the module with: `uv init clm-system --module`
|
|
|
|
## Core Components
|
|
|
|
### 1. Streamlit Interface (app.py)
|
|
- Chat interface for contract queries
|
|
- Document similarity search
|
|
- Upload new contracts
|
|
- Manual trigger button for daily scan
|
|
|
|
### 2. Document Ingestion (src/ingestion.py)
|
|
- File validation and type detection
|
|
- OCR for scanned PDFs
|
|
- Text extraction from PDF/DOCX/TXT
|
|
- LanceDB vector storage
|
|
- Basic chunking strategy
|
|
|
|
### 3. RAG Pipeline (src/rag.py)
|
|
- LangChain retrieval
|
|
- Context-aware querying
|
|
- Source citation (document name, page)
|
|
- Embedding generation
|
|
|
|
### 4. Manual Agent (src/agent.py)
|
|
- Manual trigger via script
|
|
- Expiration date detection (30-day alert)
|
|
- Conflict identification
|
|
- Email report generation
|
|
|
|
### 5. Manual Triggers
|
|
- scripts/manual_scan.py: Run daily scan
|
|
- scripts/generate_reports.py: Generate reports
|
|
- Both can be run via cron or manually
|
|
|
|
## Technology Stack
|
|
- **Framework**: Streamlit (chat interface)
|
|
- **Vector DB**: LanceDB (lightweight, local)
|
|
- **LLM Framework**: LangChain
|
|
- **File Processing**: PyPDF2, python-docx
|
|
- **OCR**: pytesseract
|
|
- **Email**: smtplib
|
|
|
|
## Data Flow
|
|
1. **Ingestion**: File → Validation → Processing → LanceDB
|
|
2. **Query**: User Input → RAG → Context Retrieval → Response
|
|
3. **Manual Scan**: Trigger → Contract Scan → Analysis → Email Report |