Files
clm-system/PLANNING/Task.md

85 lines
5.1 KiB
Markdown
Raw Permalink Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

**Time Allocation:** approx 2 hours total
**AI Tools Encouraged:** Use any generative AI tools (ChatGPT, Claude, Copilot, etc.) to accelerate development
**Deliverables:** Automation pipelines, API integrations, testing frameworks, and deployment configurations
 **Contract Lifecycle Management (CLM) Automation**
The company is streamlining its Contract Lifecycle Management (CLM) process.  Currently, contracts are stored in a disorganized manner across various departments.  The goal is to create an intelligent platform that can:
- Index: Automatically ingest contracts from different sources.
- Understand: Extract key information (dates, parties, clauses).
- Alert: Identify potential issues (conflicts in contact info, approaching expiration dates).
- Provide Access: Make contract information easily accessible to authorized users via a chatbot and daily reports.
- Enable Insights: Detect similar contract versions (for version control and review).
The candidate should generate a synthetic dataset of 10-15 documents of varying types.  Include these document formats:
- PDFs (4-5):  Standard contracts, scanned contracts (requiring OCR)
- Word Documents (.docx) (3-4):  Draft contracts, amendments
- Text Files (.txt) (2-3):  Contract summaries, email correspondence related to contracts.
- Unstructured Text (2): e.g. meeting notes regarding a contract.  These should be purposefully less structured to test the candidate's ability to handle complexity.
Within the documents, there should be:
- Variations:  Several versions of the same contract with minor changes.
- Conflicts:  Deliberately include conflicting information (e.g., different addresses for the same company, different expiration dates) across different documents.
- Key Dates: Include contract creation dates, renewal dates, termination dates, and potentially clauses with specific effective dates.
- Metadata:  Some documents should have existing metadata (e.g., contract name, department) to test how the candidate integrates metadata into the pipeline.  Others should not. 
The candidate should build a system that can:
- Document Ingestion & Indexing:
- Load documents from a designated folder (simulating an incoming source).
- Use a suitable vector database (e.g., ChromaDB, Pinecone) to store embeddings.  Justify the database choice in a brief comment/readme.
- Implement basic chunking strategy.
- RAG Pipeline:
- Create a RAG pipeline using Langchain or a similar framework.  The pipeline should retrieve relevant document chunks based on user queries.
- AI Agent (Daily Report Generation):
- Develop an AI agent using Langchain Agents (or similar) that runs daily.
- The agent should automatically:
- Identify approaching contract expiration dates (within the next 30 days).
- Detect conflicting information (e.g., different addresses for the same company in different contracts).  The agent must describe what the conflict is and where it is (document names).
- Summarize the findings in an email report to a predefined email address (provide a test email address).  The email should be formatted clearly and concisely.
- Chatbot Interface:
- Create a simple chatbot interface (e.g., using Streamlit, Gradio, or a basic Flask app).
- When a user asks a question about a contract, the chatbot should:
- Use the RAG pipeline to retrieve relevant document chunks.
- Provide the AI answer to the user.
- Clearly cite the source documents used to generate the answer (e.g., document name and page number).  This is crucial.
- Document Similarity:
- Implement a function to find similar documents based on semantic similarity (using embedding similarity).  The user should be able to input a document name and receive a list of similar documents.
- Error Handling & Logging: Implement basic error handling and logging to ensure the system's reliability.
MCP Server Integration (Bonus): If the candidate has time, ask them to describe how they would integrate the RAG pipeline with an existing MCP server (e.g., using REST APIs). This doesn't necessarily require full implementation but demonstrating understanding of the process.
Success Criteria & Evaluation
- Functionality :
- Document Ingestion & Indexing:  Does the system load and index the documents correctly?  Are embeddings generated? (10%)
- RAG Pipeline: Does the RAG pipeline retrieve relevant information based on user queries?  (15%)
- AI Agent: Does the agent run daily and generate accurate reports with detected conflicts and approaching expiration dates? (15%)
- Chatbot Interface: Does the chatbot provide answers and cite sources correctly? (10%)
- Code Quality & Design:
- Readability: Is the code well-formatted and easy to understand?
- Modularity: Is the code organized into logical modules?
- Documentation: Is the code adequately documented?
- Error Handling: Does the code handle errors gracefully?
- Reasoning & Approach:
- Framework Choices:  Were appropriate frameworks and tools selected? Justification of choices is important.
- Problem Solving:  Did the candidate demonstrate a logical approach to solving the problem?
- Scalability:  Did the candidate consider the scalability of the solution? (e.g., vector database choice, chunking strategy)