AI Tinkerers Toronto - AI in Production: Evals & Observability Workshop with Weights & Biases - Sunday [AI Tinkerers - Toronto]

AI Tinkerers Toronto - AI in Production: Evals & Observability Workshop with Weights & Biases - Sunday

Feb
23
Sunday
Sunday, February 23rd, 2025 2PM to 6PM (EDT)
Address Info
Available on RSVP acceptance

Event Ended

This event has already taken place.

Attendees 258+ registered
Our attendees include engineers and leaders from Google, Meta, Shopify, and Qualcomm, specializing in machine learning, Python, and data science, alongside award-winning researchers published at NeurIPS.

AI Tinkerers Toronto - AI in Production: Evals & Observability Workshop in partnership with Weights & Biases is now live! Join us to build robust LLM apps with Observability and Evaluation!

(Illustration) An illustration of a workshop on AI in production, featuring a speaker presenting to an audience. Text: AI in Production: Evals & Observability Workshop Trace, Compile, Eval with Weights & Biases February 23, 2025, 2PM - 6PM Human Feedback AI TINKERERS TORONTO W&B flat | Colors: #2E2781, #FFFFFF, #F26622 Note: The image is a stylized depiction of a workshop, not a real photograph.  It uses drawn figures and a simplified representation of a presentation setting.


What is AI Tinkerers?

AI Tinkerers is a meetup designed exclusively for practitioners who possess technical, machine learning, and entrepreneurial backgrounds and are actively building and working with foundation models, and are eager to connect with like-minded technologists.


Who is this for?

  • AI tinkerers and AI Engineers building or managing LLM-based systems in production
  • Teams looking to replace ad-hoc “vibes-based” approaches with robust, future-proof evaluation pipelines
  • Anyone interested in reproducible logging, real-time analytics, and frictionless iteration with LLMs
  • Prior eval experience not required; basic Python experience is recommended for this workshop.

AI in Production: Evals & Observability Workshop

Join us for a hands-on workshop where you’ll learn to build and evaluate LLM-powered applications with robust observability practices. Leveraging tools from Weights & Biases Weave, we’ll walk you through common pain points and proven solutions to keep AI models performing in real-world production environments.

(Illustration) This image illustrates a process for building robust LLM apps using observability and evaluation. It involves looking at data, compiling datasets, iterating on prompts and models, and constantly evaluating using methods like programmatic checks, human feedback, and LLM judges. Text: Building robust LLM apps with Observability & Evaluation 1 Look at your data Trace all LLM interactions What works and what doesn't Collect Feedback and Annotate Compile datasets Use Datasets for Weights & Biases Get started with 3 lines of code https://wandb.me/tryweave Confidently Iterate on Prompts, New Models, Product Features 2 Evaluate constantly 3 methods of evaluation grading 1 2 Alignment 3 Programmatic Human in the loop LLM Judge *Honorable mention - Vibes Evals - Offline Evaluations - Online Evaluations - Guardrails flat | Colors: #90EE90, #FFFFE0, #FFB6C1, #ADD8E6 Note: The image uses drawings and text to explain a concept, making it an illustration.  It's a diagrammatic representation of a process.

Alex Volkov is a leading AI practitioner and evangelist at Weights & Biases, as well as the host of the popular Thursd/AI webcast, which attracts thousands of live listeners each week. Known for his ability to stay ahead of industry trends, Alex combines hands-on experience with a deep understanding of the complex landscape of emerging AI tools, evaluation methodologies, and observability best practices. His work helping practitioners move beyond prototypes into reliable, production-scale AI systems makes him an invaluable voice in the field.


What You’ll Learn

  • Tracing LLM Interactions
    Understand how to easily log each step of an LLM workflow, pinpoint issues faster, and maintain a historical record of inputs and outputs for better collaboration and troubleshooting.
  • Collecting & Leveraging User Feedback
    See why user annotations and feedback loops are vital to refining model performance. Build interactive UIs that capture structured inputs from real-world usage.
  • Dataset Creation & Versioning
    Learn best practices for compiling evaluation datasets from logs and user feedback. Manage versions effortlessly so you can track improvements over time.
  • Evaluation Pipelines
    Dive into three primary evaluation methods:
    • Programmatic – String matching, regex checks, and structured output validation.
    • Human-in-the-Loop – Manual labeling when tasks require nuance and domain expertise.
    • LLM-as-Judge – Automate grading with a second, higher-quality (or specialized) model to evaluate output correctness.
  • Meta Evaluation and Improvement of LLM Judges
    Building llm-judges is the beginning, evaluating the LLM judge, aligning with human judges, and more advanced techniques for a state of the art robust evaluation suite for your LLM application.

(Other) A screenshot of a software interface displaying evaluation metrics, including model latency, total tokens, and right_according_to_lim_judge scores, comparing 'No Context' and 'With Context' scenarios. Text: thursdai Projects jan-llm-evals-workshop-s... Compare-evaluations Compare Evaluations No Context 6cfb Baseline With Context 1f6c + Add evaluation Summary Metrics Model Latency (avg) right_according_to_llm_judge.match 1 0.926 0.8 0.6 right_according_to_llm_ju 0.4 0.370 0.2 Model Latency (avg) 6.658 5.866 Total Tokens (avg) Scorecard Evaluation No Context 6cfb With Context 1f6c Model llm_judge_api:v0 llm_judge_api:v0 Dataset inconsistency detected doomer_or_boomer_dataset:v2 doomer_or_boomer_dataset:v3 Alex Volkov thursdai-org Return to Evaluations Configure displayed metrics < 1-3 of 4 > Note: The image depicts a user interface with data visualizations, which doesn't fit neatly into the provided categories like 'photo' or 'illustration'. It's closer to a screenshot of a software application.


Agenda

  • 2:00 PM - 2:30 PM: Arrival and Food
    Grab refreshments and settle in.
  • 2:30 PM - 3:30 PM: Trace & Compile
    • Understanding Evaluation Metrics for LLMs
    • Interactive Session: Implement observability tools and dashboards
  • 3:30 PM - 4:00 PM: Break
  • 4:00 PM - 5:00 PM: Evaluations Hands-On
    • Hands-On Lab: Build and refine an LLM evaluation pipeline
    • Case Studies & Troubleshooting: Explore real-world scenarios to identify and solve common pitfalls

Featured Speaker

Alex Volkov, AI Evangelist – Weights & Biases
Alex Volkov is known for his forward-thinking expertise on production-scale AI systems. As the host of the popular ThursdAI podcast, he stays on top of emerging LLM tools, evaluation methods, and best practices. Alex will demonstrate step-by-step how to integrate Weights & Biases Weave into your AI workflows for maximum reliability and traceability.


What to Bring

  • Your laptop. Code alongside practical examples. You’ll leave with a working pipeline for data collection, evaluations, and iterative improvements.
  • An OpenAI API key

Sponsor

This event is supported by Weights & Biases Weave. This workshop is sponsored by Weights & Biases Weave. With just a few lines of code, you can log and visualize LLM interactions in intuitive dashboards. Weave also helps you evaluate and compare multiple models—from GPT-4 to R1 to custom fine-tuned solutions—so you can confidently scale your AI applications.

(Illustration) The image shows a pattern of yellow circles on a black background. flat | Colors: #FFCC00, #000000 Note: The image appears to be a digitally created graphic design or pattern, rather than a photo or other type of image.  It doesn't contain any recognizable characters or attempt to represent a real-world object or scene.


Event Host

Hosted by the Human Feedback Foundation, a Linux Foundation AI & Data nonprofit advancing a human-centric future for AI.

(Logo) The image displays a logo for 'Human Feedback', featuring an abstract orange graphic symbol next to the company name in black text on a white background. Text: Human Feedback Colors: #F58220, #000000 Note: This image features a distinctive graphic element paired with a company name, designed to represent a brand. This fits the definition of a logo.


More Information:


Get ready to dive into LLM observability and evaluation! Whether you’re moving beyond prototypes or maintaining large deployments, this session will help prevent regressions, track performance, and optimize for the future. See you in Toronto!

Ready for more?

Check out other posts from this blog.

View all posts

Message Organizers

Questions? We're here to help.