Meeting Summary: CS 486-686 Lecture/Lab Spring 2025

Date: March 06, 2025, 08:07 AM Pacific Time (US and Canada)
Meeting ID: 893 0161 6954

Overview

The session, led by Greg, focused on various technical and process-oriented topics including code chunking, vector database retrieval, evaluation methods, and the challenges of integrating parsing tools. The discussion addressed both conceptual frameworks and practical troubleshooting of the tools used in the project.

Quick Recap

Vector Database & RAG Process:
Greg led a session on the retrieval process (referred to as RAG) and the construction of a vector database for code retrieval. The discussion included strategies for enhancing inputs to the embedding function and establishing a quantitative baseline through code chunking.
Tree Sitter C Parser Issues:
Technical challenges were discussed regarding the use of the tree sitter C parser and its integration with free sitter and other tools. Concerns about code setups and library paths were raised, with alternative resolutions being explored.
Code Splitting & Evaluation:
The meeting reviewed two versions of a code splitter, compared baseline retrieval results with enhanced methods, and debated strategies for chunking more intelligently—including handling comments and preserving directory structure. Issues with running the code and API limits were also discussed.
Project Progress:
Updates on a project involving code chunking and analysis were shared, including the generation of JSON outputs containing chunk metadata and troubleshooting directory and version conflicts.

Next Steps

Develop Chunk Comparison:
Greg will further develop the chunk comparison mechanism to analyze overlap between different sets of code chunks.
Standardize Critical Code Sections:
The team will standardize the format for representing critical code sections (including file name, start line, and end line).
Experiment with Roo:
The team will test using Roo to generate “ground truth” data for relevant code chunks to support question-answering tasks.
Enhance Comment Chunking:
The team will modify the code splitter to better handle chunking of comments in source code.
Implement Enhanced Retrieval:
The team will implement and test an improved retrieval process that builds on the baseline code splitter results.
Define Retrieval Constraints:
Team members will determine appropriate constraints for top-k retrieval and optimal chunk sizes for evaluation.
Establish Evaluation Standards:
A standardized set of evaluation questions and a corresponding ground truth needs to be created.
Investigate Llama Index:
The team will explore how Llama Index handles node relationships in the context of code retrieval.
Explore API Options:
Options for embedding API usage will be explored, including using Voyage AI with a credit card to bypass rate limit issues.
Push Updated Code:
Greg will push the updated code for the new chunker and comparison tools to the class repository.

Detailed Discussion Topics

Rag, Vector Database, and Code Splitting

Greg introduced the core process of using RAG to build a vector database tailored for code retrieval. Key points included:

Enhancing the input to the embedding function for improved retrieval.
Using code chunking to establish a quantitative baseline.
Evaluating current code splitter algorithms to group similar source code segments.
Sharing ongoing work and code updates via the class repository.

Retrieval Process Diagram

flowchart TD
    A[Source Code] --> B[Code Splitter]
    B --> C[Extracted Code Chunks]
    C --> D[Metadata Embedding]
    D --> E[Vector Database]
    E --> F[Retriever]
    F --> G[Retrieval Outcome]

Tree Sitter C Parser Challenges

The session also addressed technical issues related to the tree sitter C parser, including:

Difficulties integrating with free sitter and other tools.
Issues with code setups and library paths.
Exploration of alternative approaches, though some uncertainties about the optimal configuration remain.

Establishing a Code Splitting Baseline for Retrieval

A baseline using the code splitter was proposed to facilitate comparisons with enhanced retrieval processes:

Emphasis on extracting and embedding metadata (such as file names, start lines, and end lines).
Development of a ground truth to isolate the minimal and most relevant portions of the source code.
The importance of using precision and recall metrics to evaluate retrieval performance.

Baseline Setup Diagram

graph LR
    A[Input Code] --> B[Baseline Code Splitter]
    B --> C[Extracted Metadata]
    C --> D[Ground Truth Generation]

Exploring Retriever Outcomes and Precision

Greg examined the potential outcomes from retrieval systems:

A retriever may capture either exactly the ground truth or include extraneous information, affecting precision.
Partial retrievals of the ground truth would impact recall.
Suggestions were made to restrict the context for more accurate retrieval, with plans to test baseline methods by the end of the lecture.

Code Splitter Enhancements and Discussion

Two versions of the code splitter were demonstrated:

Original Version: Provided basic code chunking.
Enhanced Version: Offered more granular chunking with added metadata (line numbers, file names, paths) and improved grouping.

Additional discussion points included:

Refining comment chunking.
Preserving relative paths to the source directory root.
Introducing a new measure to evaluate the overlap between retrieved chunks and the ground truth.

A brief break was taken before resuming a deeper discussion on these improvements.

Addressing Code Issues and Troubleshooting

The session also covered multiple code-related issues:

Problems with executing code examples prompted a pull of the latest updates from the repository.
Redundant setup code was removed.
Issues with API keys and rate limits for embeddings were addressed, with a suggestion to use Voyage AI.
Troubleshooting steps for tree-sitter installation and Python version compatibility (including recommendations to downgrade) were discussed.
Directory-related issues encountered during execution were also noted.

Code Chunking Progress and Future Steps

Updates on the code chunking project included:

Introduction of a modified chunker that generates JSON objects containing chunk metadata.
Consideration for automating source file detection via introspection or command-line arguments.
Plans to compare two sets of code chunks and further refine the retrieval process.
An acknowledged uncertainty about the modifications, indicating that further investigation is required.

Code Chunking Tool and Evaluation Challenges

Greg reviewed a newly created code chunking tool:

The tool processes and segments code, generating helpful metadata.
API rate limiting issues were encountered during the use of an AI service.
A novel scoring mechanism for comparing code chunks—independent of chunk size and top-k parameters—was proposed.
The team recognized the challenge of creating a consistent evaluation methodology across different chunk sizes and stressed the need for standard evaluation questions and ground truth data.

Evaluation Workflow Diagram

flowchart TD
    A[Code Splitting] --> B[Baseline Retrieval]
    B --> C[Enhanced Retrieval]
    C --> D[Comparison Mechanism]
    D --> E{Evaluation Metrics}
    E --> F[Precision]
    E --> G[Recall]

Summary

The meeting encompassed a broad spectrum of topics:

Vector Database Development & Code Splitting: Focused on enhancing code retrieval through improved chunking and metadata extraction.
Integration Challenges: Addressed the technical hurdles in using the tree sitter C parser and aligning it with other tools.
Baseline & Enhanced Retrieval: Emphasized creating a baseline for comparison, integrating enhanced retrieval methods, and evaluating system performance using precision and recall.
Technical Troubleshooting: Identified and proposed solutions for running code examples, managing API rate limits, and resolving compatibility issues.
Future Directions: Outlined next steps that include refining the chunk comparison mechanism, standardizing evaluation methods, and implementing improved retrieval processes.

The next phases of the project will focus on refining these processes, standardizing evaluation metrics, and integrating the enhanced tools into the class repository for further testing and development.