llama.mtmd

D bindings and wrappers for libmtmd (multimodal support).

libmtmd encodes images and audio into token embeddings that a language model can attend to alongside ordinary text tokens.

Typical usage:

auto mtmd = MtmdContext.initFromFile("mmproj.gguf", model.ptr);
auto bmp  = mtmd.loadBitmap("photo.jpg");
auto chunks = InputChunks.create();
auto txt  = mtmd_input_text(&fullPrompt[0], true, true);
mtmd.tokenize(chunks, txt, [bmp.ptr]);
llama_pos nPast;
mtmd.evalChunks(ctx.ptr, chunks, 0, 0, 512, true, nPast);
// ... then sample as usual with SamplerChain

Members

Aliases

ggml_backend_sched_eval_callback
alias ggml_backend_sched_eval_callback = bool function(void* tensor, bool ask, void* user_data) nothrow

Eval callback: return false to cancel.

ggml_log_callback
alias ggml_log_callback = void function(int level, const(char)* text, void* user_data) nothrow

Logging callback: level is ggml_log_level cast to int.

Enums

mtmd_input_chunk_type
enum mtmd_input_chunk_type

Chunk type emitted by mtmd_tokenize.

Functions

mtmd_bitmap_free
void mtmd_bitmap_free(mtmd_bitmap* bitmap)
Undocumented in source but is binding to C. You might be able to learn more by searching the web for its name.
mtmd_bitmap_get_data
const(ubyte)* mtmd_bitmap_get_data(const(mtmd_bitmap)* bitmap)
Undocumented in source but is binding to C. You might be able to learn more by searching the web for its name.
mtmd_bitmap_get_id
const(char)* mtmd_bitmap_get_id(const(mtmd_bitmap)* bitmap)

Optional string ID used for KV-cache tracking.

mtmd_bitmap_get_n_bytes
size_t mtmd_bitmap_get_n_bytes(const(mtmd_bitmap)* bitmap)
Undocumented in source but is binding to C. You might be able to learn more by searching the web for its name.
mtmd_bitmap_get_nx
uint mtmd_bitmap_get_nx(const(mtmd_bitmap)* bitmap)
Undocumented in source but is binding to C. You might be able to learn more by searching the web for its name.
mtmd_bitmap_get_ny
uint mtmd_bitmap_get_ny(const(mtmd_bitmap)* bitmap)
Undocumented in source but is binding to C. You might be able to learn more by searching the web for its name.
mtmd_bitmap_init
mtmd_bitmap* mtmd_bitmap_init(uint nx, uint ny, const(ubyte)* data)

Create an image bitmap from raw RGB pixels (RGBRGBRGB…; length must equal nx * ny * 3).

mtmd_bitmap_init_from_audio
mtmd_bitmap* mtmd_bitmap_init_from_audio(size_t n_samples, const(float)* data)

Create an audio bitmap from float PCM samples.

mtmd_bitmap_is_audio
bool mtmd_bitmap_is_audio(const(mtmd_bitmap)* bitmap)
Undocumented in source but is binding to C. You might be able to learn more by searching the web for its name.
mtmd_bitmap_set_id
void mtmd_bitmap_set_id(mtmd_bitmap* bitmap, const(char)* id)

Set the bitmap ID.

mtmd_context_params_default
mtmd_context_params mtmd_context_params_default()

Returns a default-initialised mtmd_context_params.

mtmd_decode_use_mrope
bool mtmd_decode_use_mrope(mtmd_context* ctx)

True if the model uses M-RoPE (Multimodal RoPE) for llama_decode.

mtmd_decode_use_non_causal
bool mtmd_decode_use_non_causal(mtmd_context* ctx)

True if the model requires a non-causal attention mask for llama_decode.

mtmd_default_marker
const(char)* mtmd_default_marker()

Returns the default media marker string ("<__media__>").

mtmd_encode_chunk
int mtmd_encode_chunk(mtmd_context* ctx, const(mtmd_input_chunk)* chunk)

Encode a single image/audio chunk. Returns 0 on success.

mtmd_free
void mtmd_free(mtmd_context* ctx)

Frees a multimodal context.

mtmd_get_audio_sample_rate
int mtmd_get_audio_sample_rate(mtmd_context* ctx)

Audio sample rate in Hz (e.g. 16 000 for Whisper), or -1 if unsupported.

mtmd_get_output_embd
float* mtmd_get_output_embd(mtmd_context* ctx)

Pointer to the float embeddings from the last mtmd_encode_chunk call.

mtmd_helper_bitmap_init_from_buf
mtmd_bitmap* mtmd_helper_bitmap_init_from_buf(mtmd_context* ctx, const(ubyte)* buf, size_t len)

Load from an in-memory buffer (JPEG/PNG/BMP/GIF/WAV/MP3/FLAC). Thread-safe.

mtmd_helper_bitmap_init_from_file
mtmd_bitmap* mtmd_helper_bitmap_init_from_file(mtmd_context* ctx, const(char)* fname)

Load an image or audio file into a bitmap. Thread-safe. Returns null on failure.

mtmd_helper_decode_image_chunk
int mtmd_helper_decode_image_chunk(mtmd_context* ctx, llama_context* lctx, const(mtmd_input_chunk)* chunk, float* encoded_embd, llama_pos n_past, llama_seq_id seq_id, int n_batch, llama_pos* new_n_past)

Decode an already-encoded image chunk (embeddings pre-calculated).

mtmd_helper_eval_chunk_single
int mtmd_helper_eval_chunk_single(mtmd_context* ctx, llama_context* lctx, const(mtmd_input_chunk)* chunk, llama_pos n_past, llama_seq_id seq_id, int n_batch, bool logits_last, llama_pos* new_n_past)

Like mtmd_helper_eval_chunks but for a single chunk.

mtmd_helper_eval_chunks
int mtmd_helper_eval_chunks(mtmd_context* ctx, llama_context* lctx, const(mtmd_input_chunks)* chunks, llama_pos n_past, llama_seq_id seq_id, int n_batch, bool logits_last, llama_pos* new_n_past)

Eval all chunks: text via llama_decode, images via mtmd_encode_chunk + llama_decode. Returns 0 on success. NOT thread-safe.

mtmd_helper_get_n_pos
llama_pos mtmd_helper_get_n_pos(const(mtmd_input_chunks)* chunks)

Total position count across all chunks (may differ from n_tokens for M-RoPE).

mtmd_helper_get_n_tokens
size_t mtmd_helper_get_n_tokens(const(mtmd_input_chunks)* chunks)

Total token count across all chunks (for KV-cache sizing).

mtmd_helper_log_set
void mtmd_helper_log_set(ggml_log_callback log_callback, void* user_data)

Set logging callback (also calls mtmd_log_set internally).

mtmd_image_tokens_get_id
const(char)* mtmd_image_tokens_get_id(const(mtmd_image_tokens)* image_tokens)
Undocumented in source but is binding to C. You might be able to learn more by searching the web for its name.
mtmd_image_tokens_get_n_pos
llama_pos mtmd_image_tokens_get_n_pos(const(mtmd_image_tokens)* image_tokens)
Undocumented in source but is binding to C. You might be able to learn more by searching the web for its name.
mtmd_image_tokens_get_n_tokens
size_t mtmd_image_tokens_get_n_tokens(const(mtmd_image_tokens)* image_tokens)
Undocumented in source but is binding to C. You might be able to learn more by searching the web for its name.
mtmd_image_tokens_get_nx
size_t mtmd_image_tokens_get_nx(const(mtmd_image_tokens)* image_tokens)
Undocumented in source but is binding to C. You might be able to learn more by searching the web for its name.
mtmd_image_tokens_get_ny
size_t mtmd_image_tokens_get_ny(const(mtmd_image_tokens)* image_tokens)
Undocumented in source but is binding to C. You might be able to learn more by searching the web for its name.
mtmd_init_from_file
mtmd_context* mtmd_init_from_file(const(char)* mmproj_fname, const(llama_model)* text_model, mtmd_context_params ctx_params)

Initialises a multimodal context from a projector GGUF file. Returns null on failure (bad path, incompatible model, etc.).

mtmd_input_chunk_copy
mtmd_input_chunk* mtmd_input_chunk_copy(const(mtmd_input_chunk)* chunk)
Undocumented in source but is binding to C. You might be able to learn more by searching the web for its name.
mtmd_input_chunk_free
void mtmd_input_chunk_free(mtmd_input_chunk* chunk)
Undocumented in source but is binding to C. You might be able to learn more by searching the web for its name.
mtmd_input_chunk_get_id
const(char)* mtmd_input_chunk_get_id(const(mtmd_input_chunk)* chunk)
Undocumented in source but is binding to C. You might be able to learn more by searching the web for its name.
mtmd_input_chunk_get_n_pos
llama_pos mtmd_input_chunk_get_n_pos(const(mtmd_input_chunk)* chunk)
Undocumented in source but is binding to C. You might be able to learn more by searching the web for its name.
mtmd_input_chunk_get_n_tokens
size_t mtmd_input_chunk_get_n_tokens(const(mtmd_input_chunk)* chunk)
Undocumented in source but is binding to C. You might be able to learn more by searching the web for its name.
mtmd_input_chunk_get_tokens_image
const(mtmd_image_tokens)* mtmd_input_chunk_get_tokens_image(const(mtmd_input_chunk)* chunk)
Undocumented in source but is binding to C. You might be able to learn more by searching the web for its name.
mtmd_input_chunk_get_tokens_text
const(llama_token)* mtmd_input_chunk_get_tokens_text(const(mtmd_input_chunk)* chunk, size_t* n_tokens_output)
Undocumented in source but is binding to C. You might be able to learn more by searching the web for its name.
mtmd_input_chunk_get_type
mtmd_input_chunk_type mtmd_input_chunk_get_type(const(mtmd_input_chunk)* chunk)
Undocumented in source but is binding to C. You might be able to learn more by searching the web for its name.
mtmd_input_chunks_free
void mtmd_input_chunks_free(mtmd_input_chunks* chunks)
Undocumented in source but is binding to C. You might be able to learn more by searching the web for its name.
mtmd_input_chunks_get
const(mtmd_input_chunk)* mtmd_input_chunks_get(const(mtmd_input_chunks)* chunks, size_t idx)
Undocumented in source but is binding to C. You might be able to learn more by searching the web for its name.
mtmd_input_chunks_init
mtmd_input_chunks* mtmd_input_chunks_init()
Undocumented in source but is binding to C. You might be able to learn more by searching the web for its name.
mtmd_input_chunks_size
size_t mtmd_input_chunks_size(const(mtmd_input_chunks)* chunks)
Undocumented in source but is binding to C. You might be able to learn more by searching the web for its name.
mtmd_log_set
void mtmd_log_set(ggml_log_callback log_callback, void* user_data)

Set a logging callback.

mtmd_support_audio
bool mtmd_support_audio(mtmd_context* ctx)

True if the model supports audio input.

mtmd_support_vision
bool mtmd_support_vision(mtmd_context* ctx)

True if the model supports image input.

mtmd_tokenize
int mtmd_tokenize(mtmd_context* ctx, mtmd_input_chunks* output, const(mtmd_input_text)* text, const(mtmd_bitmap*)* bitmaps, size_t n_bitmaps)

Tokenise a text prompt that may contain media markers. Returns 0 on success, 1 on bitmap-count mismatch, 2 on preprocessing error.

Structs

InputChunks
struct InputChunks

A list of tokenized input chunks produced by MtmdContext.tokenize. Supports foreach iteration over const(mtmd_input_chunk)* elements.

MtmdBitmap
struct MtmdBitmap

An image or audio bitmap loaded from a file or raw buffer. Construct via MtmdBitmap.fromRGB, MtmdBitmap.fromAudio, or MtmdContext.loadBitmap.

MtmdContext
struct MtmdContext

A multimodal projector context loaded from a GGUF file. Encodes images and audio into embeddings for the paired language model. Check if (ctx) after construction.

mtmd_bitmap
struct mtmd_bitmap
Undocumented in source.
mtmd_context
struct mtmd_context
Undocumented in source.
mtmd_context_params
struct mtmd_context_params

Parameters for mtmd_init_from_file. Always initialise via mtmd_context_params_default() and then override fields.

mtmd_image_tokens
struct mtmd_image_tokens
Undocumented in source.
mtmd_input_chunk
struct mtmd_input_chunk
Undocumented in source.
mtmd_input_chunks
struct mtmd_input_chunks
Undocumented in source.
mtmd_input_text
struct mtmd_input_text

Text input descriptor passed to mtmd_tokenize.

Meta