llama.mtmd

D bindings and wrappers for libmtmd (multimodal support).

libmtmd encodes images and audio into token embeddings that a language model can attend to alongside ordinary text tokens.

Typical usage:

auto mtmd = MtmdContext.initFromFile("mmproj.gguf", model.ptr);
auto bmp  = mtmd.loadBitmap("photo.jpg");
auto chunks = InputChunks.create();
auto txt  = mtmd_input_text(&fullPrompt[0], true, true);
mtmd.tokenize(chunks, txt, [bmp.ptr]);
llama_pos nPast;
mtmd.evalChunks(ctx.ptr, chunks, 0, 0, 512, true, nPast);
// ... then sample as usual with SamplerChain

Members

Aliases

ggml_backend_sched_eval_callback alias ggml_backend_sched_eval_callback = bool function(void* tensor, bool ask, void* user_data) nothrow: Eval callback: return false to cancel.
ggml_log_callback alias ggml_log_callback = void function(int level, const(char)* text, void* user_data) nothrow: Logging callback: level is ggml_log_level cast to int.

Enums

mtmd_input_chunk_type enum mtmd_input_chunk_type: Chunk type emitted by mtmd_tokenize.

Functions

mtmd_bitmap_free void mtmd_bitmap_free(mtmd_bitmap* bitmap): Undocumented in source but is binding to C. You might be able to learn more by searching the web for its name.
mtmd_bitmap_get_data const(ubyte)* mtmd_bitmap_get_data(const(mtmd_bitmap)* bitmap): Undocumented in source but is binding to C. You might be able to learn more by searching the web for its name.
mtmd_bitmap_get_id const(char)* mtmd_bitmap_get_id(const(mtmd_bitmap)* bitmap): Optional string ID used for KV-cache tracking.
mtmd_bitmap_get_n_bytes size_t mtmd_bitmap_get_n_bytes(const(mtmd_bitmap)* bitmap): Undocumented in source but is binding to C. You might be able to learn more by searching the web for its name.
mtmd_bitmap_get_nx uint mtmd_bitmap_get_nx(const(mtmd_bitmap)* bitmap): Undocumented in source but is binding to C. You might be able to learn more by searching the web for its name.
mtmd_bitmap_get_ny uint mtmd_bitmap_get_ny(const(mtmd_bitmap)* bitmap): Undocumented in source but is binding to C. You might be able to learn more by searching the web for its name.
mtmd_bitmap_init mtmd_bitmap* mtmd_bitmap_init(uint nx, uint ny, const(ubyte)* data): Create an image bitmap from raw RGB pixels (RGBRGBRGB…; length must equal nx * ny * 3).
mtmd_bitmap_init_from_audio mtmd_bitmap* mtmd_bitmap_init_from_audio(size_t n_samples, const(float)* data): Create an audio bitmap from float PCM samples.
mtmd_bitmap_is_audio bool mtmd_bitmap_is_audio(const(mtmd_bitmap)* bitmap): Undocumented in source but is binding to C. You might be able to learn more by searching the web for its name.
mtmd_bitmap_set_id void mtmd_bitmap_set_id(mtmd_bitmap* bitmap, const(char)* id): Set the bitmap ID.
mtmd_context_params_default mtmd_context_params mtmd_context_params_default(): Returns a default-initialised mtmd_context_params.
mtmd_decode_use_mrope bool mtmd_decode_use_mrope(mtmd_context* ctx): True if the model uses M-RoPE (Multimodal RoPE) for llama_decode.
mtmd_decode_use_non_causal bool mtmd_decode_use_non_causal(mtmd_context* ctx): True if the model requires a non-causal attention mask for llama_decode.
mtmd_default_marker const(char)* mtmd_default_marker(): Returns the default media marker string ("<__media__>").
mtmd_encode_chunk int mtmd_encode_chunk(mtmd_context* ctx, const(mtmd_input_chunk)* chunk): Encode a single image/audio chunk. Returns 0 on success.
mtmd_free void mtmd_free(mtmd_context* ctx): Frees a multimodal context.
mtmd_get_audio_sample_rate int mtmd_get_audio_sample_rate(mtmd_context* ctx): Audio sample rate in Hz (e.g. 16 000 for Whisper), or -1 if unsupported.
mtmd_get_output_embd float* mtmd_get_output_embd(mtmd_context* ctx): Pointer to the float embeddings from the last mtmd_encode_chunk call.
mtmd_helper_bitmap_init_from_buf mtmd_bitmap* mtmd_helper_bitmap_init_from_buf(mtmd_context* ctx, const(ubyte)* buf, size_t len): Load from an in-memory buffer (JPEG/PNG/BMP/GIF/WAV/MP3/FLAC). Thread-safe.
mtmd_helper_bitmap_init_from_file mtmd_bitmap* mtmd_helper_bitmap_init_from_file(mtmd_context* ctx, const(char)* fname): Load an image or audio file into a bitmap. Thread-safe. Returns null on failure.
mtmd_helper_decode_image_chunk int mtmd_helper_decode_image_chunk(mtmd_context* ctx, llama_context* lctx, const(mtmd_input_chunk)* chunk, float* encoded_embd, llama_pos n_past, llama_seq_id seq_id, int n_batch, llama_pos* new_n_past): Decode an already-encoded image chunk (embeddings pre-calculated).
mtmd_helper_eval_chunk_single int mtmd_helper_eval_chunk_single(mtmd_context* ctx, llama_context* lctx, const(mtmd_input_chunk)* chunk, llama_pos n_past, llama_seq_id seq_id, int n_batch, bool logits_last, llama_pos* new_n_past): Like mtmd_helper_eval_chunks but for a single chunk.
mtmd_helper_eval_chunks int mtmd_helper_eval_chunks(mtmd_context* ctx, llama_context* lctx, const(mtmd_input_chunks)* chunks, llama_pos n_past, llama_seq_id seq_id, int n_batch, bool logits_last, llama_pos* new_n_past): Eval all chunks: text via llama_decode, images via mtmd_encode_chunk + llama_decode. Returns 0 on success. NOT thread-safe.
mtmd_helper_get_n_pos llama_pos mtmd_helper_get_n_pos(const(mtmd_input_chunks)* chunks): Total position count across all chunks (may differ from n_tokens for M-RoPE).
mtmd_helper_get_n_tokens size_t mtmd_helper_get_n_tokens(const(mtmd_input_chunks)* chunks): Total token count across all chunks (for KV-cache sizing).
mtmd_helper_log_set void mtmd_helper_log_set(ggml_log_callback log_callback, void* user_data): Set logging callback (also calls mtmd_log_set internally).
mtmd_image_tokens_get_id const(char)* mtmd_image_tokens_get_id(const(mtmd_image_tokens)* image_tokens): Undocumented in source but is binding to C. You might be able to learn more by searching the web for its name.
mtmd_image_tokens_get_n_pos llama_pos mtmd_image_tokens_get_n_pos(const(mtmd_image_tokens)* image_tokens): Undocumented in source but is binding to C. You might be able to learn more by searching the web for its name.
mtmd_image_tokens_get_n_tokens size_t mtmd_image_tokens_get_n_tokens(const(mtmd_image_tokens)* image_tokens): Undocumented in source but is binding to C. You might be able to learn more by searching the web for its name.
mtmd_image_tokens_get_nx size_t mtmd_image_tokens_get_nx(const(mtmd_image_tokens)* image_tokens): Undocumented in source but is binding to C. You might be able to learn more by searching the web for its name.
mtmd_image_tokens_get_ny size_t mtmd_image_tokens_get_ny(const(mtmd_image_tokens)* image_tokens): Undocumented in source but is binding to C. You might be able to learn more by searching the web for its name.
mtmd_init_from_file mtmd_context* mtmd_init_from_file(const(char)* mmproj_fname, const(llama_model)* text_model, mtmd_context_params ctx_params): Initialises a multimodal context from a projector GGUF file. Returns null on failure (bad path, incompatible model, etc.).
mtmd_input_chunk_copy mtmd_input_chunk* mtmd_input_chunk_copy(const(mtmd_input_chunk)* chunk): Undocumented in source but is binding to C. You might be able to learn more by searching the web for its name.
mtmd_input_chunk_free void mtmd_input_chunk_free(mtmd_input_chunk* chunk): Undocumented in source but is binding to C. You might be able to learn more by searching the web for its name.
mtmd_input_chunk_get_id const(char)* mtmd_input_chunk_get_id(const(mtmd_input_chunk)* chunk): Undocumented in source but is binding to C. You might be able to learn more by searching the web for its name.
mtmd_input_chunk_get_n_pos llama_pos mtmd_input_chunk_get_n_pos(const(mtmd_input_chunk)* chunk): Undocumented in source but is binding to C. You might be able to learn more by searching the web for its name.
mtmd_input_chunk_get_n_tokens size_t mtmd_input_chunk_get_n_tokens(const(mtmd_input_chunk)* chunk): Undocumented in source but is binding to C. You might be able to learn more by searching the web for its name.
mtmd_input_chunk_get_tokens_image const(mtmd_image_tokens)* mtmd_input_chunk_get_tokens_image(const(mtmd_input_chunk)* chunk): Undocumented in source but is binding to C. You might be able to learn more by searching the web for its name.
mtmd_input_chunk_get_tokens_text const(llama_token)* mtmd_input_chunk_get_tokens_text(const(mtmd_input_chunk)* chunk, size_t* n_tokens_output): Undocumented in source but is binding to C. You might be able to learn more by searching the web for its name.
mtmd_input_chunk_get_type mtmd_input_chunk_type mtmd_input_chunk_get_type(const(mtmd_input_chunk)* chunk): Undocumented in source but is binding to C. You might be able to learn more by searching the web for its name.
mtmd_input_chunks_free void mtmd_input_chunks_free(mtmd_input_chunks* chunks): Undocumented in source but is binding to C. You might be able to learn more by searching the web for its name.
mtmd_input_chunks_get const(mtmd_input_chunk)* mtmd_input_chunks_get(const(mtmd_input_chunks)* chunks, size_t idx): Undocumented in source but is binding to C. You might be able to learn more by searching the web for its name.
mtmd_input_chunks_init mtmd_input_chunks* mtmd_input_chunks_init(): Undocumented in source but is binding to C. You might be able to learn more by searching the web for its name.
mtmd_input_chunks_size size_t mtmd_input_chunks_size(const(mtmd_input_chunks)* chunks): Undocumented in source but is binding to C. You might be able to learn more by searching the web for its name.
mtmd_log_set void mtmd_log_set(ggml_log_callback log_callback, void* user_data): Set a logging callback.
mtmd_support_audio bool mtmd_support_audio(mtmd_context* ctx): True if the model supports audio input.
mtmd_support_vision bool mtmd_support_vision(mtmd_context* ctx): True if the model supports image input.
mtmd_tokenize int mtmd_tokenize(mtmd_context* ctx, mtmd_input_chunks* output, const(mtmd_input_text)* text, const(mtmd_bitmap*)* bitmaps, size_t n_bitmaps): Tokenise a text prompt that may contain media markers. Returns 0 on success, 1 on bitmap-count mismatch, 2 on preprocessing error.

Structs

InputChunks struct InputChunks: A list of tokenized input chunks produced by MtmdContext.tokenize. Supports foreach iteration over const(mtmd_input_chunk)* elements.
MtmdBitmap struct MtmdBitmap: An image or audio bitmap loaded from a file or raw buffer. Construct via MtmdBitmap.fromRGB, MtmdBitmap.fromAudio, or MtmdContext.loadBitmap.
MtmdContext struct MtmdContext: A multimodal projector context loaded from a GGUF file. Encodes images and audio into embeddings for the paired language model. Check if (ctx) after construction.
mtmd_bitmap struct mtmd_bitmap: Undocumented in source.
mtmd_context struct mtmd_context: Undocumented in source.
mtmd_context_params struct mtmd_context_params: Parameters for mtmd_init_from_file. Always initialise via mtmd_context_params_default() and then override fields.
mtmd_image_tokens struct mtmd_image_tokens: Undocumented in source.
mtmd_input_chunk struct mtmd_input_chunk: Undocumented in source.
mtmd_input_chunks struct mtmd_input_chunks: Undocumented in source.
mtmd_input_text struct mtmd_input_text: Text input descriptor passed to mtmd_tokenize.

llama.mtmd

Members

Aliases

Enums

Functions

Structs

Meta

Source