Meeting Summary: CS 486-686 Lecture/Lab Spring 2025

Date: March 19, 2025
Time: 04:54 PM Pacific Time (US & Canada)
Meeting ID: 893 0161 6954

Quick Recap

Ground Truth Source Code Chunks:
Greg demonstrated the creation of ground truth chunks from the XD6 source code using powerful language models. A prompt was shown to generate these chunks with models including Claude 3.7, which provided the most comprehensive output.
System Call Ground Truth:
The discussion emphasized the need for a unique version of each system call that maps to the correct kernel function. The intent is to produce one ground truth per question, potentially merging results from the best-performing models.
Code Chunk Retrieval & Summarization:
Potential issues regarding the retrieval and summarization process were identified. Questions were raised regarding the feasibility of using file paths (and relative paths) and the need to refine the methods for generating ground truth.
Embedding & Model Hosting:
Different embedding models and hosting options were discussed, including OpenAI, Anthropic’s API, and local solutions. Permission issues on local resources (like USF’s CADE machine) were also mentioned.
Upcoming Hackathon Event:
An introduction was made for the Don’s Hack event—a 3-day hackathon co-hosted with Women in Tech. The theme centers on Campus Tech Projects, with various departments contributing prompts.

Next Steps

Professor’s Tasks:
- Compute and provide all ground truths for the retrieval task.
- Supply sample LLM-as-judge code for student use.
- Send out a summary detailing the project setup decisions.
- Conduct further LLM-as-judge exercises in the next class.
Student Tasks:
- Select an embedding model for both the baseline and enhanced versions.
- Implement baseline retrieval using provided code and the chosen embedding model.
- Develop enhanced retrieval techniques that improve upon the baseline method.
- Compare the results of the enhanced retrieval against the baseline using the same LLM for question answering.
- Complete the retrieval projects by the upcoming Tuesday.
- Report group project results in next Tuesday’s class.

Detailed Discussion

Generating Ground Truth Source Code Chunks

Greg outlined the process of generating ground truth chunks from XD6’s source code using various powerful models. A prompt was demonstrated, and results were compared across models:

Observation:
Models varied in verbosity and relevance; Claude 3.7 was noted as the most comprehensive.
Objective:
Achieve consensus on the most effective approach to generate reliable ground truth chunks.

System Call Ground Truth Discussion

Key Point:
There is a necessity for a single, consolidated version of a system call to the correct kernel function.
Discussion Points:
- The possibility of merging results from various models.
- Determining whether the system call is more pertinent than liberal interpretations.
- Evaluating the current sampling, with considerations for establishing an upper bound.
- Discussion on recall and precision in the context of the retrieval performance program.

Retrieval Metrics and Ground Truth Discussion

Metrics Covered:
- Recall
- Precision
- F1 Score
Approach:
Rather than using a chunk-by-chunk basis, comparisons were made using the final set of retrieved lines.
Additional Considerations:
- Possibility of having a subset or superset of ground truth chunks without adversely affecting the metrics.
- A need for the introduction of a numerical metric to calculate recall and precision.
- Leveraging powerful models to derive ground truth for the retrieval task.

Code Chunk Retrieval and Summarization Issues

Issues Identified:
- Concerns were raised over some reported coverage percentages versus manual observations.
- There is uncertainty in measuring the overlap between retrieved chunks and the ground truth.
Plan Forward:
- Strive to ensure a more inclusive ground truth by favoring the inclusion of additional relevant code chunks.
- Experiment with different language and embedding models to enhance performance.

Improving Retrieval System with Relative Paths

Topic Discussion:
- Feasibility of incorporating file paths and relative paths into the retrieval system.
- The importance of configuring a reasonable ground truth.
- Consideration of different methods for system enhancement, including:
  - Using a weaker model followed by re-ranking.
  - Applying source code for reasoning and manipulation.
Goal:
Create a simple baseline and then incrementally improve upon it.

Embedding Options and Model Hosting

Discussed Options:
- OpenAI embeddings
- Anthropic’s API
- Local hosting solutions (e.g., USF’s CADE machine with GPUs)
Challenges:
- Permission issues with local resource utilization.
Suggestion:
Students are encouraged to select an environment (local or cloud) best suited to their project, rather than forcing a uniform setup.

OpenAI for Experiments and Baseline Discussion

Experimentation Proposal:
- The investment of $10 for OpenAI experiments, with the requirement of a credit card (ensuring no actual charge unless exceeded).
Baseline Development:
- The baseline includes selecting an embedding model and setting up the corresponding ground truths.
- A sample LLM-as-judge code will be provided.
Collaboration Note:
Aria, president of Women in Tech and ACM, was introduced to the team.

Don’s Hack Event: Campus Tech Projects

Event Overview:
- A 3-day hackathon co-hosted with Women in Tech.
- The event’s focus is on Campus Tech, with various departmental prompts suggesting real-world problems faced by students and faculty.
Incentives:
- Prizes (e.g., TVs for each team member and collaboration with Red Bull)
- Alumni panel sessions
- Free meals and beverages
Call to Action:
All participants were encouraged to sign up, with the note that projects need not strictly adhere to provided prompts.

Visual Representations

Retrieval System Flow

The following Mermaid diagram illustrates the process flow of the retrieval system, from ground truth generation to evaluation:

flowchart TD
    A[Generate Ground Truth Source Code Chunks] --> B[Baseline Retrieval Implementation]
    B --> C[Develop Enhanced Retrieval Methods]
    C --> D[Compare Enhanced vs. Baseline Using LLM-as-Judge]

Don’s Hack Event Overview

This diagram provides an overview of the structure and key elements of Don’s Hack event:

flowchart LR
    A[Don's Hack Event]
    A --> B[Co-hosted with Women in Tech]
    B --> C[Campus Tech Projects]
    C --> D[Departmental Prompts]
    D --> E[Team Formation & Project Development]
    E --> F[Alumni Panel & Networking]
    E --> G[Incentives: Prizes, TVs, Red Bull Collaboration]
    F --> H[Free Meals & Beverages]

Summary

The meeting focused on multiple aspects of the project, such as:

Generating Ground Truth Chunks:
Demonstrated through model comparisons, with a focus on obtaining the most comprehensive and relevant output.
System Call and Retrieval Metrics:
Emphasis was placed on creating a unified ground truth for system calls and refining metrics like recall, precision, and F1 score based on the end results of retrieved lines.
System Improvements:
Discussions highlighted challenges in code chunk retrieval and possibilities for enhancement through relative path usage, improved embedding, and better evaluation methods.
Experimentation with External APIs:
OpenAI experiments and baseline development were proposed with a minimal financial threshold.
Hackathon Event Promotion:
The upcoming Don’s Hack event presents an opportunity for students to apply their learning in a competitive, collaborative, and rewarding environment.

This well-structured meeting summary with visual representations can serve as an effective reference for future discussions and project development.