Meeting Summary for CS 486-686 Lecture/Lab – Spring 2025

Date: Feb 18, 2025
Time: 08:08 AM Pacific Time (US and Canada)
Meeting ID: 893 0161 6954

Overview

The meeting provided an in-depth discussion on various AI and machine learning topics, including updates to GPT-4, the introduction of GROK-3, and concepts of true randomness in AI models. The discussion also focused on upcoming projects, notably the development of a new retrieval system for large source code repositories using tools like Llama Index and Retrieval-Augmented Generation (RAG) systems.

Quick Recap

During the meeting, several key topics were addressed:

AI Model Updates:
- Discussion on improvements in GPT-4 and the introduction of GROK-3.
- Exploration of a distinct AI model technology that differs from conventional transformers.
- Examination of true randomness in AI systems, with comparisons drawn to quantum-level randomness.
New Project Proposal:
- The team is set to develop a retrieval system for large source code repositories.
- Tools such as Llama Index and RAG systems will be experimented with.
- Diverse strategies for indexing and contextualizing source code were discussed.
Potential Applications:
- AI models will be explored for training and assisting with coding tasks.
- The discussion covered enhancing code search and contextualization, including integrating natural language descriptions with code.

Next Steps

Action Items:

Steven:
- Lead a Llama Index workshop session during the lab section tomorrow.
Greg:
- Write the project specification for the semantic code search project.
- Share articles about code editors that use language models for captioning or describing code.
- Change the exposed OpenAI API key.
Team:
- Explore and evaluate various strategies for indexing and searching large codebases using Llama Index.
- Develop evaluation methods for the semantic code search tool, including:
  - Comparing outputs with other tools like Roo Code and Ader.
  - Creating synthetic questions and answers for testing.
  - Using language models as automated judges in assessments.
- Investigate augmenting code context with natural language descriptions, particularly for code lacking comments.

Summary of Discussion Topics

1. Exploring AI Models and True Randomness

AI Model Updates:
The discussion reviewed updates to GPT-4 and the introduction of GROK-3. The team also considered a novel AI model technology that is distinct from traditional transformers.
True Randomness:
The concept of true randomness in AI models was examined, with analogies drawn to the randomness observed at the quantum level, opening up avenues for further research.

2. RAG System Development and Source Code

Greg outlined the next project focusing on building a retrieval system for large source code repositories. Key points included:

Tools & Techniques:
- Using Llama Index and RAG systems to build an efficient retrieval system.
- Discussing a potential future shift from RAG to fine-tuning techniques.
Current Practices:
- Analyzing how Roo Code currently searches for related code using a fast grep implementation instead of embeddings.
- Looking into augmenting coding assistants with a custom-built indexer.

Mermaid Diagram: RAG System & Source Code Retrieval Architecture

flowchart TD
    A[Source Code Repositories] --> B[Llama Index]
    B --> C[Semantic Code Search Project]
    C --> D[Retrieval-Augmented Generation (RAG)]
    D --> E[AI-Assisted Coding / Roo Code Integration]

3. Vector Database and Document Retrieval

Current Implementation:
- The team reviewed a simple vector database built using a bag-of-words approach.
- Issues discussed included the negative effects of stop words and varying sentence lengths on semantic accuracy.
Potential Improvements:
- Removing stop words or adjusting the vocabulary.
- Evaluating a document retrieval system on a PDF (specifically the XD6 kernel implementation) by querying the term “file descriptor” revealed that, although relevant information was retrieved, assessing the overall quality was challenging due to the complexity of the vocabulary.

4. PDF to Markdown Conversion Discussion

Challenges with PDF Extraction:
- Converting PDFs by chunking text into 512-byte segments resulted in a loss of context such as page numbers and section headings.
Solution Introduced:
- Marker GitHub was presented as a tool capable of converting PDFs to Markdown, which retains images and code snippets effectively.
- Markdown output can be more easily processed by language models and integrated into documentation tools like Rue.

5. Exploring Llama Index for Coding Assistance

Application of Llama Index:
- The potential of Llama Index to create high-quality semantic chunks from extensive code bases was discussed.
- The team plans to explore different strategies by tuning embedding models and vector stores.
- An upcoming demonstration of Llama Index is planned to spark a broader conversation on RAG applications for source code.

6. AI Models and Custom Servers

Model Use Cases:
- Discussion on using AI models in training with a nod to a model named S1, trained on accounting data and claimed to match the coding ability of DeepSick V1R1.
Custom Integration:
- The idea of creating custom servers to provide enhanced contextual information to AI models was introduced.
- A demonstration on using Llama Index to index a large codebase (specifically the XD6 source code) was also presented.

7. Llama Index Demonstration: Document Querying and Retrieval

Demonstration Highlights:
- Greg demonstrated the rapid loading and querying of documents using Llama Index.
- Creation of an in-memory vector index and query engine was shown.
- The tool’s capability to handle multiple content types and deliver detailed responses was emphasized.
- The demonstration included how the system identifies and scores relevant documents during a query.

8. Exploring LLMs for Code Description

Code Base with Minimal Comments:
- The potential of employing large language models (LLMs) to describe code snippets in projects with minimal comments was explored.
Technical Considerations:
- Issues related to the chunking process in directory readers and document segmentation were discussed.
- The role of Tree Sitter, a universal parsing tool used in various code editors and analysis tools (e.g., Ader and RIP grep), was noted.

9. Multilayer Code Search and Integration

Development Strategies:
- A sophisticated multilayer approach for code search was discussed, integrating tools such as Rip, Grep, and Tree Sitter (for parsing code into abstract syntax trees).
Evaluation Methods:
- The team planned to evaluate the effectiveness of their tool by:
  - Using synthetic questions and human revision.
  - Leveraging large language models as judges.
Integration Focus:
- Prospects for integrating the tool with Ruse code generation and applying a Model Control Protocol (MCP) for deeper system integration were considered.
- Upcoming sessions include a workshop with Steven and continued work on a separate embedding model.

Mermaid Diagram: Multilayer Code Search Process

flowchart TD
    A[Raw Code Base] --> B[Rip/Grep Searches]
    B --> C[Tree Sitter Parsing into ASTs]
    C --> D[Semantic Indexing via Llama Index]
    D --> E[Multilayer Code Search]
    E --> F[Evaluation (Synthetic Q&A, LLM Judges)]
    F --> G[Integration with Ruse]

Conclusion

The meeting was productive and comprehensive, covering updates to AI models, new directions for code retrieval, and innovative approaches for document processing. The team established a clear series of follow-up tasks and actionable items aimed at improving their semantic code search tool and further integrating AI models into the coding process.

By leveraging tools like Llama Index and exploring multimodal integration strategies, the team is poised to enhance their approach to large-scale code retrieval and semantic evaluation in future projects.