Decodes a token batch; returns 0 on success.
Encodes a batch (encoder-decoder models); returns 0 on success.
All output embeddings packed contiguously. Valid after decode; shape is [n_outputs * nEmbd]. Returns null when pooling is LLAMA_POOLING_TYPE_NONE or for generative models.
Embeddings for the ith output token (-1 = last). Returns null for invalid index.
Pooled embeddings for a sequence. Returns null when pooling is LLAMA_POOLING_TYPE_NONE.
Logits at output position idx (-1 = last). Valid until the next decode call.
Clear the KV cache. Pass data = true to also zero-fill memory.
Copy the current state into dst. Returns the number of bytes written.
Byte count of the current state. Use this to size a buffer before stateGetData.
Load state from a session file. On success tokensOut is filled and tokenCount holds the number of tokens read; returns true.
Save the state to a session file, recording tokens as the session prompt. Returns true on success.
Copy sequence seqId's KV cache into dst. Returns bytes written.
Byte count required to snapshot sequence seqId.
Restore a KV snapshot from src into sequence destSeqId. Returns bytes consumed; 0 means failure.
Restore the state from src. Returns the number of bytes consumed.
Raw memory handle. Use for sequence management (copy, remove, shift, etc.).
Active pooling type as an int (compare to LLAMA_POOLING_TYPE_* constants).
Create from explicit params.
Create from a window size and batch size.
A llama_context that frees itself on destruction.