GGML ONNX Runtime

Learn how the GGML ONNX Runtime lets you run pretrained ONNX models using ggml’s quantized formats, without writing C/C++, covering vision encoder examples.

Overview

GGML is an open source machine learning library written in C, it powers several popular open-srouce projects like llama.cpp and whisper.cpp which allow you to run state-of-the-art transformer models on consumer hardware. These projects use ggml to convert model weights into memory optimized quantized formats and then load those parameters into computational graphs defined in C / C++.

ONNX is a file format for describing a computational graphs along with it’s parameters. ONNX graphs can be created automatically from pretrained models defined in PyTorch, Tensorflow, etc and run via and ONNX Runtime implementation.

The GGML ONNX Runtime provides an implementation for running ONNX graphs in ggml. This allows you to convert and run pre-trained models without having to define anything in C / C++. The project is still early stage but can run some basic vision encoders which do not have existing ggml implementations.

Links

https://github.com/abetlen/ggml-python/pull/23
This pull request adds an ONNX backend for ggml-python, enabling graph execution and weight quantization.

Tech stack