Files
clm-system/PLANNING/low_level_design.md

2.2 KiB

CLM System - Low Level Design

Minimal Folder Structure (Python + Streamlit)

clm-system/
├── app.py                 # Main Streamlit chat interface
├── requirements.txt       # Dependencies
├── config.py             # Configuration settings
├── data/                 # Synthetic contract documents
│   ├── contracts/        # PDF, DOCX, TXT files
│   └── metadata/         # Document metadata
├── src/
│   ├── __init__.py
│   ├── ingestion.py       # Document processing & indexing
│   ├── rag.py            # RAG pipeline
│   ├── agent.py          # Manual trigger agent
│   └── utils.py          # Helper functions
├── scripts/
│   ├── manual_scan.py     # Manual trigger script
│   └── generate_reports.py # Report generation script
└── tests/                # Basic tests
    └── test_ingestion.py

Setup Instructions

Create the module with: uv init clm-system --module

Core Components

1. Streamlit Interface (app.py)

  • Chat interface for contract queries
  • Document similarity search
  • Upload new contracts
  • Manual trigger button for daily scan

2. Document Ingestion (src/ingestion.py)

  • File validation and type detection
  • OCR for scanned PDFs
  • Text extraction from PDF/DOCX/TXT
  • LanceDB vector storage
  • Basic chunking strategy

3. RAG Pipeline (src/rag.py)

  • LangChain retrieval
  • Context-aware querying
  • Source citation (document name, page)
  • Embedding generation

4. Manual Agent (src/agent.py)

  • Manual trigger via script
  • Expiration date detection (30-day alert)
  • Conflict identification
  • Email report generation

5. Manual Triggers

  • scripts/manual_scan.py: Run daily scan
  • scripts/generate_reports.py: Generate reports
  • Both can be run via cron or manually

Technology Stack

  • Framework: Streamlit (chat interface)
  • Vector DB: LanceDB (lightweight, local)
  • LLM Framework: LangChain
  • File Processing: PyPDF2, python-docx
  • OCR: pytesseract
  • Email: smtplib

Data Flow

  1. Ingestion: File → Validation → Processing → LanceDB
  2. Query: User Input → RAG → Context Retrieval → Response
  3. Manual Scan: Trigger → Contract Scan → Analysis → Email Report