Meeting Summary: CS 486-686 Lecture/Lab Spring 2025

Date: February 11, 2025, 08:08 AM Pacific Time (US and Canada)
Meeting ID: 893 0161 6954

Quick Recap

Project Presentations:
Upcoming project presentations were discussed, where each student will demo their themed chat assistant.
Large Language Models (LLMs):
The challenges of using LLMs with vast knowledge bases were addressed.
Embeddings:
The concept of generating numerical representations for words or sentences was explained, including various distance metrics.
Vector Databases:
An introduction to building vector databases was provided, highlighting storage of embeddings for efficient search.
Chunking:
The idea of breaking large amounts of information into smaller chunks was explored.
Python Program to Derive a Vocuabulary:
Discussion of a Python program designed to read a text file and generate a vocabulary, along with adding embeddings and querying capabilities.

Next Steps

Student Action:
All students are to prepare 5-minute demos of their themed chat assistants for Thursday’s class.
Presentation Schedule:
Greg will randomly assign time slots for the demos and post the schedule.
Website Update:
Greg will update the class website with recordings and summaries from the previous week.
Next Project Assignment:
Greg will post the next project assignment, which will be a RAG-based project using the Llama Index.
Research Paper:
Greg will assign the next research paper for the class to read.
Project Suggestions:
Students who are still looking for final project suggestions should see Greg during office hours.
Additional Links:
- Jay will post a link about the AI fact-finder issue on CampusWire.
- Greg will post a link regarding the cloud AI news he mentioned on CampusWire.

Detailed Topics

Project Presentations and Course Updates

Presentation Format:
Students will have approximately 5 minutes to demonstrate their themed chat assistants. Presentations will be projected via Zoom.
Schedule & Website Resources:
Greg will randomly assign time slots for demos. Recordings and summaries from previous classes will be uploaded along with an expanded resource section on the class website.
Upcoming Curriculum:
Topics will include embeddings, vector embedding distance functions, and vector databases. The class will build a simple RAG system in Python before moving on to the Llama Index framework.
Additional Recommendations:
The class is encouraged to watch and discuss an Andre Karpathy video on LLM theory and subscribe to an AI news mailing list. Reading Hacker News for tech updates was also suggested.

Addressing LLM Knowledge Base Challenges

Prompt Length Issues:
The difficulty of fitting extensive information into a single prompt was highlighted as an obstacle for LLMs.
Proposed Solution:
Pre-process data to identify likely relevant information for any given query, thereby refining the context for the LLM.
Graph Databases:
Graph databases were mentioned as a potential method for encoding relationships within the knowledge base.
Industry News:
A recent offer to purchase OpenAI for $20 billion was mentioned, along with its potential implications.

Embeddings and Semantic Sentence Relationships

Definition:
Embeddings are numerical representations that map words or sentences into a semantic vector space.
Distance Metrics:
Distances between embeddings can be computed using Euclidean distance, cosine similarity, or other metrics to determine semantic relationships.
Importance of Context:
The context of language is critical for accurately interpreting semantic similarities.

Mermaid Diagram – Embeddings Process:

flowchart TD
    A[Input Text]
    B[Embedding Process]
    C[Vector Representation]
    D[Compute Distance]
    E[Determine Semantic Relationship]

    A --> B
    B --> C
    C --> D
    D --> E

Exploring Vector Databases and Applications

Design and Function:
Vector databases store embeddings and enable efficient searches for related information.
Search Techniques:
While brute-force enumeration guarantees accuracy, approximate nearest neighbor techniques are used to improve query speed.
Improving Accuracy:
Techniques such as normalization and metadata filtering can enhance the relationship determination between words.
Domain-Specific Applications:
Potential applications include managing medical records, auto insurance policies, and supporting coding assistance systems.

Exploring Embeddings and Bag of Words

Bag of Words Technique:
This technique creates a vocabulary from a corpus and represents sentences as vectors based on word frequency.
Simple Example:
A sentence is transformed into a vector based on the number of times each word from the vocabulary appears.
Advanced Options:
An embeddings API is available to obtain richer semantic values, and further refinement of the approach is planned.

Embedding and Text Matching Discussion

Text Matching:
Embeddings can be used to represent text as vectors, enabling similarity searches.
Vector Database Querying:
A simple vector database can be created to compare embeddings and perform text matching—this approach may outperform more advanced methods like BERT in certain contexts.
Chunking:
Breaking large pieces of information into smaller chunks was discussed for integration into the matching process.
Additional Use Cases:
The potential for code-to-text and text-to-code matching was considered, though concerns about accuracy were noted.
Future Direction:
A pivot to root code exploration was suggested for further development.

Embedding and Vector Database Building

Project Importance:
Embeddings are a vital component of the ongoing project.
Data Science Club Revival:
Greg mentioned the revival of the Data Science Club following a two-year hiatus.
Database Construction:
A simple vector database is being built by adding text along with its corresponding embedding.
Search Functionality:
The database employs a nearest neighbor search (based on Euclidean distance) to retrieve relevant examples.
Further Integration:
The database will continue to be expanded and refined with additional examples.

Vector Database Implementation and Chunking

Implementation Progress:
The project is progressing toward a fully functional vector database, with emphasis on handling large texts.
Chunking Strategy:
A basic strategy involves splitting text at each period. More sophisticated techniques, such as sliding windows or preserving document structure, are under consideration.
Semantic Splitter:
A semantic splitter that adaptively selects breakpoints using embedding similarity was introduced.
Next Steps:
The teams will integrate these components to build a small RAG (Retrieval Augmented Generation) system.

Mermaid Diagram – Vector Database Pipeline:

flowchart TD
    A[Raw Text File] --> B[Text Chunking]
    B --> C[Embedding Generation]
    C --> D[Vector Database Storage]
    D --> E[Query System]
    E --> F[Embedding Query]
    F --> G[Nearest Neighbor Retrieval]

Improving Query System Embeddings

System Challenges:
Issues were identified with the query system, such as the query “which animal jumped” producing irrelevant results.
Debugging Steps:
It was suggested to print the text and the computed distances for each query to better understand the results.
Query Adjustments:
Consideration was given to refining the query (including punctuation removal) and examining the vectors for improved insights.
Consensus:
The team agreed that the current system requires further refinement and adjustment.

Python Program for Vocabulary Generation

Program Purpose:
A Python program is being developed to read a text file and generate a corresponding vocabulary.
Functionalities:
The program will support:
- Chunking the input text,
- Adding text embeddings,
- Querying embeddings,
- Optionally outputting the vocabulary to a file.
Next Project & Research Paper:
The next project will be a RAG-based assignment using the Llama Index, and a new research paper will be assigned for the class.

This summary captures the key points and next steps discussed during the meeting, as well as detailed insights into the topics surrounding embeddings, vector databases, and LLM challenges.