2.2 KiB
2.2 KiB
CLM System - Low Level Design
Minimal Folder Structure (Python + Streamlit)
clm-system/
├── app.py # Main Streamlit chat interface
├── requirements.txt # Dependencies
├── config.py # Configuration settings
├── data/ # Synthetic contract documents
│ ├── contracts/ # PDF, DOCX, TXT files
│ └── metadata/ # Document metadata
├── src/
│ ├── __init__.py
│ ├── ingestion.py # Document processing & indexing
│ ├── rag.py # RAG pipeline
│ ├── agent.py # Manual trigger agent
│ └── utils.py # Helper functions
├── scripts/
│ ├── manual_scan.py # Manual trigger script
│ └── generate_reports.py # Report generation script
└── tests/ # Basic tests
└── test_ingestion.py
Setup Instructions
Create the module with: uv init clm-system --module
Core Components
1. Streamlit Interface (app.py)
- Chat interface for contract queries
- Document similarity search
- Upload new contracts
- Manual trigger button for daily scan
2. Document Ingestion (src/ingestion.py)
- File validation and type detection
- OCR for scanned PDFs
- Text extraction from PDF/DOCX/TXT
- LanceDB vector storage
- Basic chunking strategy
3. RAG Pipeline (src/rag.py)
- LangChain retrieval
- Context-aware querying
- Source citation (document name, page)
- Embedding generation
4. Manual Agent (src/agent.py)
- Manual trigger via script
- Expiration date detection (30-day alert)
- Conflict identification
- Email report generation
5. Manual Triggers
- scripts/manual_scan.py: Run daily scan
- scripts/generate_reports.py: Generate reports
- Both can be run via cron or manually
Technology Stack
- Framework: Streamlit (chat interface)
- Vector DB: LanceDB (lightweight, local)
- LLM Framework: LangChain
- File Processing: PyPDF2, python-docx
- OCR: pytesseract
- Email: smtplib
Data Flow
- Ingestion: File → Validation → Processing → LanceDB
- Query: User Input → RAG → Context Retrieval → Response
- Manual Scan: Trigger → Contract Scan → Analysis → Email Report