Initial implementation by kimi k2 0905
This commit is contained in:
72
PLANNING/low_level_design.md
Normal file
72
PLANNING/low_level_design.md
Normal file
@@ -0,0 +1,72 @@
|
||||
# CLM System - Low Level Design
|
||||
|
||||
## Minimal Folder Structure (Python + Streamlit)
|
||||
|
||||
```
|
||||
clm-system/
|
||||
├── app.py # Main Streamlit chat interface
|
||||
├── requirements.txt # Dependencies
|
||||
├── config.py # Configuration settings
|
||||
├── data/ # Synthetic contract documents
|
||||
│ ├── contracts/ # PDF, DOCX, TXT files
|
||||
│ └── metadata/ # Document metadata
|
||||
├── src/
|
||||
│ ├── __init__.py
|
||||
│ ├── ingestion.py # Document processing & indexing
|
||||
│ ├── rag.py # RAG pipeline
|
||||
│ ├── agent.py # Manual trigger agent
|
||||
│ └── utils.py # Helper functions
|
||||
├── scripts/
|
||||
│ ├── manual_scan.py # Manual trigger script
|
||||
│ └── generate_reports.py # Report generation script
|
||||
└── tests/ # Basic tests
|
||||
└── test_ingestion.py
|
||||
```
|
||||
|
||||
## Setup Instructions
|
||||
Create the module with: `uv init clm-system --module`
|
||||
|
||||
## Core Components
|
||||
|
||||
### 1. Streamlit Interface (app.py)
|
||||
- Chat interface for contract queries
|
||||
- Document similarity search
|
||||
- Upload new contracts
|
||||
- Manual trigger button for daily scan
|
||||
|
||||
### 2. Document Ingestion (src/ingestion.py)
|
||||
- File validation and type detection
|
||||
- OCR for scanned PDFs
|
||||
- Text extraction from PDF/DOCX/TXT
|
||||
- LanceDB vector storage
|
||||
- Basic chunking strategy
|
||||
|
||||
### 3. RAG Pipeline (src/rag.py)
|
||||
- LangChain retrieval
|
||||
- Context-aware querying
|
||||
- Source citation (document name, page)
|
||||
- Embedding generation
|
||||
|
||||
### 4. Manual Agent (src/agent.py)
|
||||
- Manual trigger via script
|
||||
- Expiration date detection (30-day alert)
|
||||
- Conflict identification
|
||||
- Email report generation
|
||||
|
||||
### 5. Manual Triggers
|
||||
- scripts/manual_scan.py: Run daily scan
|
||||
- scripts/generate_reports.py: Generate reports
|
||||
- Both can be run via cron or manually
|
||||
|
||||
## Technology Stack
|
||||
- **Framework**: Streamlit (chat interface)
|
||||
- **Vector DB**: LanceDB (lightweight, local)
|
||||
- **LLM Framework**: LangChain
|
||||
- **File Processing**: PyPDF2, python-docx
|
||||
- **OCR**: pytesseract
|
||||
- **Email**: smtplib
|
||||
|
||||
## Data Flow
|
||||
1. **Ingestion**: File → Validation → Processing → LanceDB
|
||||
2. **Query**: User Input → RAG → Context Retrieval → Response
|
||||
3. **Manual Scan**: Trigger → Contract Scan → Analysis → Email Report
|
||||
Reference in New Issue
Block a user