Meeting Summary: CS 486-686 Lecture/Lab, Spring 2025

Date: February 25, 2025, 08:05 AM (Pacific Time – US & Canada)
Meeting ID: 893 0161 6954

Quick Recap

Project Progress:
Greg discussed the development of a baseline implementation for a RAG (Retrieval-Augmented Generation) system using the Llama Index and Chroma dB as the vector index.
Challenges:
The discussion highlighted several obstacles:
- Difficulties with using a well-known code base (e.g., Unix kernel) in the models.
- Issues encountered with the code splitter.
- The need to thoroughly evaluate the retrieval process.
Team Collaboration:
The introduction of GitHub Classroom for organizing group projects was discussed.
Roadmap:
The immediate roadmap includes working on the baseline implementation and reading an assigned paper for class discussion.

Next Steps

Detailed Discussion Topics

Project Progress and Upcoming Tasks

The project specifications and the upcoming discussion paper are available on the class website.
The team is working on a baseline implementation using a code splitter.
Students are encouraged to form groups (maximum of three), keeping in mind that larger teams face higher expectations.
Collaboration on troubleshooting the code splitter is encouraged due to persistent parsing issues with C code.

LLM Performance and Reasoning Challenges

Performance Measurement:
Addressing the challenges in constructing an LLM application, the discussion emphasized the importance of:
- Quantifying system performance.
- Measuring retrieval effectiveness.
- Establishing solid ground truth metrics.
Reasoning Capabilities:
The conversation also covered the nuances of reasoning in LLMs, referencing work by OpenAI and DeepSeek. Additionally, it was noted that Claude’s model claims the ability to balance fast responses with variable reasoning depth—though the accuracy of these claims remains in question.

Ader Leader Board Performance and Tools

Benchmark Performance:
The Ader Leader Board was noted to perform slightly better than the previous leader (Deep Seek R.1 with Cloud 3.5 Sonnet).
Tool Highlights:
- Discussion on the cost implications of running benchmarks.
- An upgrade from version 3.5 to 3.7 was mentioned.
- Interest in Anthropic’s new model card and API was expressed.
- The team introduced Quad Code, a terminal-based coding assistant.
Knowledge Sources:
The potential of using existing code bases as knowledge sources was discussed, with examples such as Xd 6 and the Llama Index. The role of code splitter and parser tools was underscored as crucial for broader applicability.

Developing the Baseline RAG System

Implementation Strategy:
The baseline was developed leveraging the Llama Index and Chroma dB.
Enhancements Proposed:
- Utilize a code splitter to build a simple RAG system.
- Improve chunking techniques and metadata post-processing.
- Verify the functionality of the chunker.
- Enhance data with additional context such as file paths and code line information.
Embedding Options:
Different methods for computing embeddings were discussed:
- OpenAI’s embedding service.
- Stevens’ free service.
- Local embedding models.

Below is a mermaid diagram outlining the RAG system development process:

flowchart TD
    A[Start: Baseline RAG System]
    B[Integrate Llama Index & Chroma dB]
    C[Implement Code Splitter]
    D[Enhance Chunking Techniques]
    E[Add Metadata (File Paths, Code Lines)]
    F[Compute Embeddings (Multiple Options)]
    G[Evaluate Retrieval Performance]
    
    A --> B --> C --> D --> E --> F --> G

Measuring Retrieval Performance and Model Evaluation

Key Considerations:
- The distinction between the system’s ability to retrieve relevant information and its effectiveness in using that information.
- Employing human inspection alongside automated evaluations.
- Using older models to generate synthetic data as a testing method.
Prompt Construction:
An LLM can be used to create detailed prompts, thus aiding in the measurement of RAG and embedding performance.

Challenges of Advanced Open Weight Models

Code Base Dilemma:
Using well-known code bases (e.g., the Unix kernel) makes it difficult to source less common ones.
Local Model Limitations:
- Many companies opt for local models to avoid cloud APIs and protect intellectual property.
- Running advanced open weight models locally demands significant hardware investment. Although quantized versions are more accessible, they do not perform as well as full models.

APIs for Chat GPT Security

Data Security Concerns:
- There is ongoing concern about the security of coding and data when using external APIs.
- Greg argued that language models should be treated like any other cloud service, given that businesses routinely trust these services with their data.
Execution Over Ideas:
The discussion emphasized that execution and dedication to improvement are paramount—illustrated by a reference to the Winklevoss twins’ lawsuit against Mark Zuckerberg, suggesting that success relies more on effective execution rather than the strength of the initial idea.

Evaluation of the Retrieval Step

Importance of Evaluation:
Evaluating the retrieval step is crucial for benchmarking the extraction of relevant code snippets and related information.
Proposed Methods:
- Automated evaluation with possible human oversight.
- Using a reasoning model to generate ground truth for comparisons.
- Continued work on enhancing the baseline implementation through improved chunking and metadata inclusion.

Technical Discussions and API Usage

Topics Covered:
- Code splitting techniques.
- Deployment and usage of models from Hugging Face on CPUs.
- Laptop memory requirements for running large models.
- Code changes and troubleshooting for the code splitter.
Additional Remarks:
Brief mentions were made about office temperature issues and the potential reactivation of a cooling system that had been non-operational for seven years.

Code Splitter Troubleshooting and GitHub Classroom

Troubleshooting Efforts:
- Upgrading Tree-sitter and adjusting language-specific settings.
- Considering the setup of a new virtual environment.
Team Collaboration:
- The use of GitHub Classroom was introduced to facilitate group projects.
- Emphasis was placed on forming teams, inviting collaborators, and choosing effective team names.

Project Progress and Future Plans

Current Challenges:
- Addressing installation issues with a specific tool.
- Considering the creation of a custom node parser for the Obama index.
- Managing simultaneous work on a shared repository, which may introduce merge conflicts.
Planned Supports:
- Utilizing Git GPT for assistance.
- Continuing the baseline implementation and discussing metrics for RAG performance on Thursday.
Timeline:
The project is due after spring break, and a demo day is scheduled to showcase progress.

Below is a mermaid flowchart representing the overall project workflow and next steps:

flowchart LR
    A[Review Project 2 Spec & RAG Paper]
    B[Begin Baseline RAG Implementation]
    C[Form Groups (Up to 3 Members)]
    D[Join GitHub Classroom]
    E[Troubleshoot Code Splitter Issues]
    F[Enhance RAG System (Chunking, Metadata)]
    G[Evaluate Retrieval Performance]
    H[Prepare for Demo Day Post Spring Break]

    A --> B
    B --> C
    C --> D
    D --> E
    E --> F
    F --> G
    G --> H

Conclusion

The meeting covered extensive topics including project progress, challenges in LLM performance and reasoning, leader board performance, technical hurdles, and security issues with APIs. The discussion underlined the importance of collaborative problem-solving, rigorous evaluation metrics, and systematic improvements to the RAG system. Clear next steps were outlined for advancing the project and preparing for upcoming class sessions and deliverables.