Meeting Summary for CS 486-686 Lecture/Lab Spring 2025

Date: March 18, 2025, 08:06 AM Pacific Time (US and Canada)
Meeting ID: 893 0161 6954

Quick Recap

Greg led discussions on evaluating generative AI systems with a focus on:

Information Retrieval Metrics: Emphasizing challenges in assessing code retrieval and question-answering capabilities.
Code Retrieval Techniques: Introducing a new code retrieval system and proposing methods to compare baseline versus enhanced retrieval techniques.
System Performance: Exploring approaches to improve performance, such as using large language models (LLMs) for summarization and re-ranking while addressing issues like chunk limitations and embedding models.

Next Steps

Ground Truth: Greg will provide definitive ground truth for code retrieval evaluation during the next lab section.
LLM Demonstration: Greg will demonstrate using an LLM as a judge to evaluate answer quality in the next lab session.
Baseline Experiment: Students are to experiment with the provided code_rag.py baseline implementation.
Parameter Settings: Students should use fixed parameters (e.g., chunk_lines, chunk_overlap, max_chunks, top_k) in their baseline implementations.
Enhancement Techniques: Students are encouraged to explore methods such as improved code splitting, metadata inclusion, and re-ranking in their enhanced systems.
Evaluation Process: Greg will clarify how metrics (precision, recall, F1) and answer quality are evaluated.
Embedding Models: Students must select embedding models for their implementations, considering token limits and quality requirements.

Detailed Summary

Evaluating Generative AI Systems and Retrieval Challenges

The discussion focused on the difficulties of evaluating generative AI systems. Key points included:

Task Definition: It is challenging to define a standardized task as the building block for generative AI applications.
Model Flexibility: Flexibility in choosing embedding models is essential.
Metric Introduction: Greg introduced recall, precision, and F1 metrics—where:
- Recall: Measures the proportion of relevant documents retrieved over the total number of relevant documents.
- Precision: Measures the proportion of relevant documents retrieved over the total number of documents retrieved.
- F1 Score: The harmonic mean of precision and recall, serving as a gold standard metric that balances both.

A simplified diagram of how precision and recall interact is shown below:

flowchart TD
    A[Retrieved Documents] -->|Intersect with| B[Relevant Documents]
    B -->|Calculates| C[Recall]
    A -->|Determines| D[Precision]
    C & D --> E[F1 Score: Harmonic Mean]

Code Chunks for Precision and Recall

Greg explained how code can be segmented into chunks to facilitate metric computation. Each chunk is broken down based on file names and line numbers. These sets are then used to compute intersections (common elements) and unions, which in turn allow the calculation of recall and precision.

A flowchart depicting the process for handling code chunks is presented below:

flowchart TD
    A[Source Code] --> B[Split into Code Chunks]
    B --> C[Extract File Names & Line Numbers]
    C --> D[Form Sets for Each Chunk]
    D --> E[Compute Intersections & Unions]
    E --> F[Calculate Recall & Precision]

Precision and Recall Metrics in Code Retrieval

In the context of code retrieval:

Precision: Calculated as the number of relevant code chunks retrieved divided by the total retrieved.
Recall: Determined by the number of relevant code chunks retrieved over the total number of relevant chunks.
Manual Inspection: There is also the possibility of manual query inspection.
LLM Evaluation: The eventual answer quality is proposed to be evaluated using an LLM as an impartial judge.

Token Size and Model Effectiveness

The discussion compared token sizes between local models and those from OpenAI:

Token Estimation: To estimate the maximum text size for embedding, the token count is divided by a factor of 3 or 4.
Challenges: Evaluating the effectiveness of retrieving relevant information becomes subjective when assessing the quality of answers. One suggestion was to include entire files rather than only chunks, recognizing that context window limitations could affect performance.

Code Retrieval System Development

A new code retrieval system was introduced. Key elements include:

Chunking Process: Splitting code into chunks based on file paths and line numbers.
Comparison Technique: The “Compare Chunks” feature compares one JSON file (containing chunks) to another using set operations to eliminate duplicates and determine intersections.
Embeddings: The system utilizes a Chroma database for managing code embeddings.

The process is illustrated in the diagram below:

flowchart LR
    A[Source Tree Path] --> B[Split Code into Chunks]
    B --> C[Extract File Paths & Line Numbers]
    C --> D[Compute Sets for Each Chunk]
    D --> E[Compare Chunks (JSON Files)]
    E --> F[Score Based on Intersecting Lines]
    F --> G[Store Embeddings in Chroma DB]
    G --> H[Output JSON with Metadata]

Evaluating Retrieval and Answering Questions

The evaluation strategy involves the following:

Two Retrieval Methods:
- Baseline Retrieval: Uses straightforward course data.
- Enhanced Retrieval: Incorporates metadata to improve performance.
Metrics Computation: Both methods are evaluated using precision, recall, and F1 scores.
LLM as Judge: An LLM is proposed to evaluate answer quality by categorizing responses as “good,” “better,” or “best.” The overall score is determined by averaging scores from the test set.

The evaluation process is depicted below:

flowchart TD
    A[Input Query] --> B[Baseline Retrieval System]
    A --> C[Enhanced Retrieval System]
    B --> D[Calculate Precision, Recall, F1]
    C --> E[Calculate Precision, Recall, F1]
    D & E --> F[LLM Evaluates Answer Quality]
    F --> G[Aggregate Scores: Good, Better, Best]

Embedding Models and LLMs for Retrieval

The team discussed:

Baseline vs. Enhanced Retrieval: The baseline uses fixed parameters (e.g., chunk splitting settings) while enhanced retrieval leverages metadata and possibly LLM-based summarization.
Parameter Fixation: The number of chunks retrieved and the LLM’s context length will be frozen at baseline levels.
Model Selection: Students are encouraged to experiment with different embedding models while keeping token limits and performance quality in mind.

Addressing Chunk Limitation Challenges

Key points regarding chunk limitations include:

Impact on Response Quality: Adjusting the number of chunks can affect answer quality.
Potential Solutions: Options such as pre-scaling the course content or applying an LLM for re-ranking were considered.
Adaptive Approach: The current system will be fixed arbitrarily, with reevaluation if issues arise.

Code RAG Demo and Usage

A demonstration of the new “code rag” system was presented. Highlights included:

Baseline Functionality: Allows users to specify a source tree path, and generate an output JSON file containing metadata.
Chroma DB Integration: Supports the creation of an index with a Chroma database for embeddings.
Next Steps: In the upcoming lab section, the team will work on establishing a definitive ground truth and evaluate answer quality using an LLM.

Key Takeaways

Metric Balance: The F1 score effectively balances recall and precision, making it crucial for evaluating information retrieval.
System Flexibility: Flexibility in choosing embedding models and evaluation parameters is essential.
Definitive Evaluation: Establishing a clear ground truth for code retrieval is a top priority.
Enhanced Techniques: Utilizing metadata, re-ranking, and LLM-based summarization can improve system performance.
Practical Demos: The code rag system serves as a practical baseline for further development and experimentation.

This document clarifies the discussion points and ensures that the key ideas and next steps are presented in a structured, third-person format, with diagrams to visually support the concepts.