Meeting Summary: CS 486-686 Lecture/Lab Spring 2025

Date: February 20, 2025
Time: 08:23 AM Pacific Time (US and Canada)
Meeting ID: 893 0161 6954

Quick Recap

The meeting focused on enhancing the team’s code indexing and analysis capabilities. Key discussions included:

Embeddings: Evaluating options such as using Stevens account, running local embedding models, Open AI embedding APIs, and even a potential Google embedding API.
Automated Code Development: Investigating the potential and safety of automated development and refactoring using AI coding assistants.
CodeSplitter Enhancements: Brainstorming improvements such as incorporating metadata (source file name, path, class, and method) into the CodeSplitter tool.
Evaluation Methods: Proposing strategies to assess the effectiveness of the source code RAG (Retrieval Augmented Generation) system.
Additional Tools: Exploring frameworks like Llama Index for building agents on top of the source code RAG and investigating tree-sitter for parsing source code.

Next Steps

Experimentation:
- All team members will run the CodeSplitter on actual code to better understand its functionality and output.
Reading Assignments:
- All team members are to read the article about code splitting referenced by Marcus.
- Greg will send out additional reading materials on code splitting and related topics.
Project Specifications:
- Greg will finalize and distribute the project specifications for the source code RAG system.
Tool Exploration:
- All team members will explore using Llama Index as a framework for building agents on top of the source code RAG.
- All team members will investigate tree-sitter and its capabilities in parsing source code.
Evaluation Research:
- Greg is tasked with researching and proposing evaluation methods for the source code RAG system.
- All team members should consider fallback strategies for handling unparsable code within the RAG system.

Detailed Meeting Discussion

1. Embeddings Discussion and Potential Options

Overview:
Greg discussed the use of embeddings for their project.
Options Considered:
- Using Voyage AI
- Running a local embedding model
- Utilizing Open AI embedding APIs
- Exploring the possibility of a Google embedding API
Concerns:
- The performance impact of a local model compared to embedding APIs.

2. Automated Code Development and Safety

Discussion Points:
- The potential of automated code development and refactoring was examined with enthusiasm and caution regarding safety.
- Two approaches for handling large code bases were proposed:
  - Inserting all code into a prompt (deemed impractical for very large code bases).
  - Using AI coding assistants to add features, fix issues, or optimize code.
Perspective:
- The discussion emphasized the need to understand how AI coding assistants work, which may also help secure future roles in coding.

3. CodeSplitter and Analysis Tools

Tools Mentioned:
- Existing tools such as Grep and Roo were discussed for code analysis.
Custom CodeSplitter Proposal:
- A custom tool may include metadata (source file, path, class, and method) in each chunk.
Current Issues Noted:
- The present CodeSplitter lacks metadata storage.
- Concerns about repeated lines within code chunks were raised, signaling a need for further investigation.

4. Enhancing the Code Base Indexing System

Objective:
- To improve indexing of the code base by selecting and indexing specific code chunks with augmented metadata.
Approach:
- Break into small groups to brainstorm methods for enhancing the indexing tool.
- Focus on maximizing the retrieval of related code segments and associating metadata (e.g., source code snippet, line numbers, file paths).
Goal:
- Enhance the coding assistant’s ability to meet specific human objectives by accurately locating code sections.

Mermaid Diagram: Enhanced Code Base Indexing System

graph LR
  A[Source Code] --> B[CodeSplitter]
  B --> C[Code Chunks]
  C --> D[Metadata Augmentation]
  D --> E[Embeddings]
  E --> F[Vector Database]
  F --> G[Source Code RAG System]
  G --> H[AI Coding Assistant]

5. Indexing Large Code Base Challenges

Discussion Points:
- Discussed challenges include determining the content for embedding as well as what metadata to attach.
- The team is encouraged to devise proposals detailing the components of the embedding and associated metadata.
Goal:
- To quantify and improve the effectiveness of the embeddings in achieving their intended purpose.

6. Evaluation Methods and Group Participation

Structure:
- Participants were divided into three groups to brainstorm evaluation methods.
- Each group had 30 minutes to discuss before electing a representative to share findings.
Additional Considerations:
- Experimentation with various types of source code.
- Detailed understanding of code statements and parameter functions was emphasized.
- Participants were actively encouraged to ask questions and provide feedback.

7. Exploring CodeSplitter and Tree-Sitter

Topics Covered:
- The meeting detailed the need to:
  - Better understand and possibly modify the CodeSplitter.
  - Explore the capabilities of tree-sitter in parsing and representing code as a tree structure.
Considerations:
- Discussion on how to split text into manageable chunks.
- Evaluating the placement of comments and summarizing code within functions.
Outcome:
- Acknowledgment that further questions remain, prompting continued exploration.

8. Vector Metadata and Chunking Strategy

Process Overview:
- Metadata is inserted into vector objects after the vectorization of code.
Proposed Method:
- A compiler-like process:
  1. Code is transformed into a low-level representation.
  2. It is then processed into an abstract syntax description.
  3. Finally, a control flow graph is built.
- This graph aids in attaching metadata (e.g., file information, line numbers, location in code).
Chunking Approach:
- Use a hierarchical strategy where each chunk represents:
  - A complete class or function, or even a single line when necessary.
- Metadata includes:
  - Type of chunk
  - File path
  - Line number
  - Class name
  - Method name
  - A summary generated by the language model (LM)

Mermaid Diagram: Vector Metadata and Chunking Strategy

graph TD
  A[Raw Code] --> B[Low-level Representation]
  B --> C[Abstract Syntax Generation]
  C --> D[Control Flow Graph]
  D --> E[Metadata Attachment]
  E --> F[LM Summary Generation]

9. CodeSplitter Algorithm and RAG

Key Points:
- Emphasized the need for better understanding and testing of the CodeSplitter algorithm on real code.
- An article describing code splitting (as developed by a specific company and integrated into Llama Index) was mentioned as forthcoming.
Additional Considerations:
- Discussion on understanding the overall workflow, including syntax tree structure and fallback strategies (e.g., tree surfing) to ensure robust parsing.
- The focus is on evaluating and building agents for the source code RAG system.
Long-Term Goal:
- To potentially build an in-house coding assistant using their source code RAG system.

Conclusion

The meeting spanned multiple facets of code analysis and indexing, highlighting both the technical challenges and the collaborative steps required to enhance the team’s tools. Emphasis was placed on the critical role of metadata augmentation, effective chunking strategies, and robust evaluation methods to ensure the success of the source code RAG system and the eventual development of a custom coding assistant.