.Dd January 1, 2024
.Dt LLAMAFILE 1
.Os Mozilla Ocho
.Sh NAME
.Nm llamafile
.Nd large language model runner
.Sh SYNOPSIS
.Nm
.Op Fl Fl server
.Op flags...
.Fl m Ar model.gguf
.Op Fl Fl mmproj Ar vision.gguf
.Nm
.Op Fl Fl cli
.Op flags...
.Fl m Ar model.gguf
.Fl p Ar prompt
.Nm
.Op Fl Fl cli
.Op flags...
.Fl m Ar model.gguf
.Fl Fl mmproj Ar vision.gguf
.Fl Fl image Ar graphic.png
.Fl p Ar prompt
.Sh DESCRIPTION
.Nm
is a large language model tool. It has use cases such as:
.Pp
.Bl -dash -compact
.It
Code completion
.It
Prose composition
.It
Chatbot that passes the Turing test
.It
Text/image summarization and analysis
.El
.Sh OPTIONS
The following options are available:
.Bl -tag -width indent
.It Fl Fl version
Print version and exit.
.It Fl h , Fl Fl help
Show help message and exit.
.It Fl Fl cli
Puts program in command line interface mode. This flag is implied when a
prompt is supplied using either the
.Fl p
or
.Fl f
flags.
.It Fl Fl server
Puts program in server mode. This will launch an HTTP server on a local
port. This server has both a web UI and an OpenAI API compatible
completions endpoint. When the server is run on a desk system, a tab
browser tab will be launched automatically that displays the web UI.
This
.Fl Fl server
flag is implied if no prompt is specified, i.e. neither the
.Fl p
or
.Fl f
flags are passed.
.It Fl m Ar FNAME , Fl Fl model Ar FNAME
Model path in the GGUF file format.
.Pp
Default:
.Pa models/7B/ggml-model-f16.gguf
.It Fl Fl mmproj Ar FNAME
Specifies path of the LLaVA vision model in the GGUF file format. If
this flag is supplied, then the
.Fl Fl model
and
.Fl Fl image
flags should also be supplied.
.It Fl s Ar SEED , Fl Fl seed Ar SEED
Random Number Generator (RNG) seed. A random seed is used if this is
less than zero.
.Pp
Default: -1
.It Fl t Ar N , Fl Fl threads Ar N
Number of threads to use during generation.
.Pp
Default: $(nproc)/2
.It Fl tb Ar N , Fl Fl threads-batch Ar N
Set the number of threads to use during batch and prompt processing. In
some systems, it is beneficial to use a higher number of threads during
batch processing than during generation. If not specified, the number of
threads used for batch processing will be the same as the number of
threads used for generation.
.Pp
Default: Same as
.Fl Fl threads
.It Fl td Ar N , Fl Fl threads-draft Ar N
Number of threads to use during generation.
.Pp
Default: Same as
.Fl Fl threads
.It Fl tbd Ar N , Fl Fl threads-batch-draft Ar N
Number of threads to use during batch and prompt processing.
.Pp
Default: Same as
.Fl Fl threads-draft
.It Fl Fl in-prefix-bos
Prefix BOS to user inputs, preceding the
.Fl Fl in-prefix
string.
.It Fl Fl in-prefix Ar STRING
This flag is used to add a prefix to your input, primarily, this is used
to insert a space after the reverse prompt. Here's an example of how to
use the
.Fl Fl in-prefix
flag in conjunction with the
.Fl Fl reverse-prompt
flag:
.Pp
.Dl "./main -r \[dq]User:\[dq] --in-prefix \[dq] \[dq]"
.Pp
Default: empty
.It Fl Fl in-suffix Ar STRING
This flag is used to add a suffix after your input. This is useful for
adding an \[dq]Assistant:\[dq] prompt after the user's input. It's added
after the new-line character (\[rs]n) that's automatically added to the
end of the user's input. Here's an example of how to use the
.Fl Fl in-suffix
flag in conjunction with the
.Fl Fl reverse-prompt
flag:
.Pp
.Dl "./main -r \[dq]User:\[dq] --in-prefix \[dq] \[dq] --in-suffix \[dq]Assistant:\[dq]"
.Pp
Default: empty
.It Fl n Ar N , Fl Fl n-predict Ar N
Number of tokens to predict.
.Pp
.Bl -dash -compact
.It
-1 = infinity
.It
-2 = until context filled
.El
.Pp
Default: -1
.It Fl c Ar N , Fl Fl ctx-size Ar N
Set the size of the prompt context. A larger context size helps the
model to better comprehend and generate responses for longer input or
conversations. The LLaMA models were built with a context of 2048, which
yields the best results on longer input / inference.
.Pp
.Bl -dash -compact
.It
0 = loaded automatically from model
.El
.Pp
Default: 512
.It Fl b Ar N , Fl Fl batch-size Ar N
Batch size for prompt processing.
.Pp
Default: 512
.It Fl Fl top-k Ar N
Top-k sampling.
.Pp
.Bl -dash -compact
.It
0 = disabled
.El
.Pp
Default: 40
.It Fl Fl top-p Ar N
Top-p sampling.
.Pp
.Bl -dash -compact
.It
1.0 = disabled
.El
.Pp
Default: 0.9
.It Fl Fl min-p Ar N
Min-p sampling.
.Pp
.Bl -dash -compact
.It
0.0 = disabled
.El
.Pp
Default: 0.1
.It Fl Fl tfs Ar N
Tail free sampling, parameter z.
.Pp
.Bl -dash -compact
.It
1.0 = disabled
.El
.Pp
Default: 1.0
.It Fl Fl typical Ar N
Locally typical sampling, parameter p.
.Pp
.Bl -dash -compact
.It
1.0 = disabled
.El
.Pp
Default: 1.0
.It Fl Fl repeat-last-n Ar N
Last n tokens to consider for penalize.
.Pp
.Bl -dash -compact
.It
0 = disabled
.It
-1 = ctx_size
.El
.Pp
Default: 64
.It Fl Fl repeat-penalty Ar N
Penalize repeat sequence of tokens.
.Pp
.Bl -dash -compact
.It
1.0 = disabled
.El
.Pp
Default: 1.1
.It Fl Fl presence-penalty Ar N
Repeat alpha presence penalty.
.Pp
.Bl -dash -compact
.It
0.0 = disabled
.El
.Pp
Default: 0.0
.It Fl Fl frequency-penalty Ar N
Repeat alpha frequency penalty.
.Pp
.Bl -dash -compact
.It
0.0 = disabled
.El
.Pp
Default: 0.0
.It Fl Fl mirostat Ar N
Use Mirostat sampling. Top K, Nucleus, Tail Free and Locally Typical samplers are ignored if used..
.Pp
.Bl -dash -compact
.It
0 = disabled
.It
1 = Mirostat
.It
2 = Mirostat 2.0
.El
.Pp
Default: 0
.It Fl Fl mirostat-lr Ar N
Mirostat learning rate, parameter eta.
.Pp
Default: 0.1
.It Fl Fl mirostat-ent Ar N
Mirostat target entropy, parameter tau.
.Pp
Default: 5.0
.It Fl l Ar TOKEN_ID(+/-)BIAS , Fl Fl logit-bias Ar TOKEN_ID(+/-)BIAS
Modifies the likelihood of token appearing in the completion, i.e.
.Fl Fl logit-bias Ar 15043+1
to increase likelihood of token
.Ar ' Hello' ,
or
.Fl Fl logit-bias Ar 15043-1
to decrease likelihood of token
.Ar ' Hello' .
.It Fl md Ar FNAME , Fl Fl model-draft Ar FNAME
Draft model for speculative decoding.
.Pp
Default:
.Pa models/7B/ggml-model-f16.gguf
.It Fl Fl cfg-negative-prompt Ar PROMPT
Negative prompt to use for guidance..
.Pp
Default: empty
.It Fl Fl cfg-negative-prompt-file Ar FNAME
Negative prompt file to use for guidance.
.Pp
Default: empty
.It Fl Fl cfg-scale Ar N
Strength of guidance.
.Pp
.Bl -dash -compact
.It
1.0 = disable
.El
.Pp
Default: 1.0
.It Fl Fl rope-scaling Ar {none,linear,yarn}
RoPE frequency scaling method, defaults to linear unless specified by the model
.It Fl Fl rope-scale Ar N
RoPE context scaling factor, expands context by a factor of
.Ar N
where
.Ar N
is the linear scaling factor used by the fine-tuned model. Some
fine-tuned models have extended the context length by scaling RoPE. For
example, if the original pre-trained model have a context length (max
sequence length) of 4096 (4k) and the fine-tuned model have 32k. That is
a scaling factor of 8, and should work by setting the above
.Fl Fl ctx-size
to 32768 (32k) and
.Fl Fl rope-scale
to 8.
.It Fl Fl rope-freq-base Ar N
RoPE base frequency, used by NTK-aware scaling.
.Pp
Default: loaded from model
.It Fl Fl rope-freq-scale Ar N
RoPE frequency scaling factor, expands context by a factor of 1/N
.It Fl Fl yarn-orig-ctx Ar N
YaRN: original context size of model.
.Pp
Default: 0 = model training context size
.It Fl Fl yarn-ext-factor Ar N
YaRN: extrapolation mix factor.
.Pp
.Bl -dash -compact
.It
0.0 = full interpolation
.El
.Pp
Default: 1.0
.It Fl Fl yarn-attn-factor Ar N
YaRN: scale sqrt(t) or attention magnitude.
.Pp
Default: 1.0
.It Fl Fl yarn-beta-slow Ar N
YaRN: high correction dim or alpha.
.Pp
Default: 1.0
.It Fl Fl yarn-beta-fast Ar N
YaRN: low correction dim or beta.
.Pp
Default: 32.0
.It Fl Fl ignore-eos
Ignore end of stream token and continue generating (implies
.Fl Fl logit-bias Ar 2-inf )
.It Fl Fl no-penalize-nl
Do not penalize newline token.
.It Fl Fl temp Ar N
Temperature.
.Pp
Default: 0.8
.It Fl Fl logits-all
Return logits for all tokens in the batch.
.Pp
Default: disabled
.It Fl Fl hellaswag
Compute HellaSwag score over random tasks from datafile supplied with -f
.It Fl Fl hellaswag-tasks Ar N
Number of tasks to use when computing the HellaSwag score.
.Pp
Default: 400
.It Fl Fl keep Ar N
This flag allows users to retain the original prompt when the model runs
out of context, ensuring a connection to the initial instruction or
conversation topic is maintained, where
.Ar N
is the number of tokens from the initial prompt to retain when the model
resets its internal context.
.Pp
.Bl -dash -compact
.It
0 = no tokens are kept from initial prompt
.It
-1 = retain all tokens from initial prompt
.El
.Pp
Default: 0
.It Fl Fl draft Ar N
Number of tokens to draft for speculative decoding.
.Pp
Default: 16
.It Fl Fl chunks Ar N
Max number of chunks to process.
.Pp
.Bl -dash -compact
.It
-1 = all
.El
.Pp
Default: -1
.It Fl ns Ar N , Fl Fl sequences Ar N
Number of sequences to decode.
.Pp
Default: 1
.It Fl pa Ar N , Fl Fl p-accept Ar N
speculative decoding accept probability.
.Pp
Default: 0.5
.It Fl ps Ar N , Fl Fl p-split Ar N
Speculative decoding split probability.
.Pp
Default: 0.1
.It Fl Fl mlock
Force system to keep model in RAM rather than swapping or compressing.
.It Fl Fl no-mmap
Do not memory-map model (slower load but may reduce pageouts if not using mlock).
.It Fl Fl numa
Attempt optimizations that help on some NUMA systems if run without this previously, it is recommended to drop the system page cache before using this. See https://github.com/ggerganov/llama.cpp/issues/1437.
.It Fl Fl recompile
Force GPU support to be recompiled at runtime if possible.
.It Fl Fl nocompile
Never compile GPU support at runtime.
.Pp
If the appropriate DSO file already exists under
.Pa ~/.llamafile/
then it'll be linked as-is without question. If a prebuilt DSO is
present in the PKZIP content of the executable, then it'll be extracted
and linked if possible. Otherwise,
.Nm
will skip any attempt to compile GPU support and simply fall back to
using CPU inference.
.It Fl Fl gpu Ar GPU
Specifies which brand of GPU should be used. Valid choices are:
.Pp
.Bl -dash
.It
.Ar AUTO :
Use any GPU if possible, otherwise fall back to CPU inference (default)
.It
.Ar APPLE :
Use Apple Metal GPU. This is only available on MacOS ARM64. If Metal
could not be used for any reason, then a fatal error will be raised.
.It
.Ar AMD :
Use AMD GPUs. The AMD HIP ROCm SDK should be installed in which case we
assume the HIP_PATH environment variable has been defined. The set of
gfx microarchitectures needed to run on the host machine is determined
automatically based on the output of the hipInfo command. On Windows,
.Nm
release binaries are distributed with a tinyBLAS DLL so it'll work out
of the box without requiring the HIP SDK to be installed. However,
tinyBLAS is slower than rocBLAS for batch and image processing, so it's
recommended that the SDK be installed anyway. If an AMD GPU could not be
used for any reason, then a fatal error will be raised.
.It
.Ar NVIDIA :
Use NVIDIA GPUs. If an NVIDIA GPU could not be used for any reason, a
fatal error will be raised. On Windows, NVIDIA GPU support will use our
tinyBLAS library, since it works on stock Windows installs. However,
tinyBLAS goes slower for batch and image processing. It's possible to
use NVIDIA's closed-source cuBLAS library instead. To do that, both MSVC
and CUDA need to be installed and the
.Nm
command should be run once from the x64 MSVC command prompt with the
.Fl Fl recompile
flag passed. The GGML library will then be compiled and saved to
.Pa ~/.llamafile/
so the special process only needs to happen a single time.
.It
.Ar DISABLE :
Never use GPU and instead use CPU inference. This setting is implied by
.Fl ngl Ar 0 .
.El
.Pp
.It Fl ngl Ar N , Fl Fl n-gpu-layers Ar N
Number of layers to store in VRAM.
.It Fl ngld Ar N , Fl Fl n-gpu-layers-draft Ar N
Number of layers to store in VRAM for the draft model.
.It Fl sm Ar SPLIT_MODE , Fl Fl split-mode Ar SPLIT_MODE
How to split the model across multiple GPUs, one of:
.Bl -dash -compact
.It
none: use one GPU only
.It
layer (default): split layers and KV across GPUs
.It
row: split rows across GPUs
.El
.It Fl ts Ar SPLIT , Fl Fl tensor-split Ar SPLIT
When using multiple GPUs this option controls how large tensors should
be split across all GPUs.
.Ar SPLIT
is a comma-separated list of non-negative values that assigns the
proportion of data that each GPU should get in order. For example,
\[dq]3,2\[dq] will assign 60% of the data to GPU 0 and 40% to GPU 1. By
default the data is split in proportion to VRAM but this may not be
optimal for performance. Requires cuBLAS.
How to split tensors across multiple GPUs, comma-separated list of
proportions, e.g. 3,1
.It Fl mg Ar i , Fl Fl main-gpu Ar i
The GPU to use for scratch and small tensors.
.It Fl nommq , Fl Fl no-mul-mat-q
Use cuBLAS instead of custom mul_mat_q CUDA kernels. Not recommended since this is both slower and uses more VRAM.
.It Fl Fl verbose-prompt
Print prompt before generation.
.It Fl Fl simple-io
Use basic IO for better compatibility in subprocesses and limited consoles.
.It Fl Fl lora Ar FNAME
Apply LoRA adapter (implies
.Fl Fl no-mmap )
.It Fl Fl lora-scaled Ar FNAME Ar S
Apply LoRA adapter with user defined scaling S (implies
.Fl Fl no-mmap )
.It Fl Fl lora-base Ar FNAME
Optional model to use as a base for the layers modified by the LoRA adapter
.It Fl Fl unsecure
Disables pledge() sandboxing on Linux and OpenBSD.
.It Fl Fl samplers
Samplers that will be used for generation in the order, separated by
semicolon, for example: top_k;tfs;typical;top_p;min_p;temp
.It Fl Fl samplers-seq
Simplified sequence for samplers that will be used.
.It Fl cml , Fl Fl chatml
Run in chatml mode (use with ChatML-compatible models)
.It Fl dkvc , Fl Fl dump-kv-cache
Verbose print of the KV cache.
.It Fl nkvo , Fl Fl no-kv-offload
Disable KV offload.
.It Fl ctk Ar TYPE , Fl Fl cache-type-k Ar TYPE
KV cache data type for K.
.It Fl ctv Ar TYPE , Fl Fl cache-type-v Ar TYPE
KV cache data type for V.
.It Fl gan Ar N , Fl Fl grp-attn-n Ar N
Group-attention factor.
.Pp
Default: 1
.It Fl gaw Ar N , Fl Fl grp-attn-w Ar N
Group-attention width.
.Pp
Default: 512
.It Fl bf Ar FNAME , Fl Fl binary-file Ar FNAME
Binary file containing multiple choice tasks.
.It Fl Fl winogrande
Compute Winogrande score over random tasks from datafile supplied by the
.Fl f
flag.
.It Fl Fl winogrande-tasks Ar N
Number of tasks to use when computing the Winogrande score.
.Pp
Default: 0
.It Fl Fl multiple-choice
Compute multiple choice score over random tasks from datafile supplied
by the
.Fl f
flag.
.It Fl Fl multiple-choice-tasks Ar N
Number of tasks to use when computing the multiple choice score.
.Pp
Default: 0
.It Fl Fl kl-divergence
Computes KL-divergence to logits provided via the
.Fl Fl kl-divergence-base
flag.
.It Fl Fl save-all-logits Ar FNAME , Fl Fl kl-divergence-base Ar FNAME
Save logits to filename.
.It Fl ptc Ar N , Fl Fl print-token-count Ar N
Print token count every
.Ar N
tokens.
.Pp
Default: -1
.It Fl Fl pooling Ar KIND
Specifies pooling type for embeddings. This may be one of:
.Pp
.Bl -dash -compact
.It
none
.It
mean
.It
cls
.El
.Pp
The model default is used if unspecified.
.El
.Sh CLI OPTIONS
The following options may be specified when
.Nm
is running in
.Fl Fl cli
mode.
.Bl -tag -width indent
.It Fl e , Fl Fl escape
Process prompt escapes sequences (\[rs]n, \[rs]r, \[rs]t, \[rs]\[aa], \[rs]\[dq], \[rs]\[rs])
.It Fl p Ar STRING , Fl Fl prompt Ar STRING
Prompt to start text generation. Your LLM works by auto-completing this
text. For example:
.Pp
.Dl "llamafile -m model.gguf -p \[dq]four score and\[dq]"
.Pp
Stands a pretty good chance of printing Lincoln's Gettysburg Address.
Prompts can take on a structured format too. Depending on how your model
was trained, it may specify in its docs an instruction notation. With
some models that might be:
.Pp
.Dl "llamafile -p \[dq][INST]Summarize this: $(cat file)[/INST]\[dq]"
.Pp
In most cases, simply colons and newlines will work too:
.Pp
.Dl "llamafile -e -p \[dq]User: What is best in life?\[rs]nAssistant:\[dq]"
.Pp
.It Fl f Ar FNAME , Fl Fl file Ar FNAME
Prompt file to start generation.
.It Fl Fl grammar Ar GRAMMAR
BNF-like grammar to constrain which tokens may be selected when
generating text. For example, the grammar:
.Pp
.Dl "root ::= \[dq]yes\[dq] | \[dq]no\[dq]"
.Pp
will force the LLM to only output yes or no before exiting. This is
useful for shell scripts when the
.Fl Fl no-display-prompt
flag is also supplied.
.It Fl Fl grammar-file Ar FNAME
File to read grammar from.
.It Fl Fl fast
Put llamafile into fast math mode. This disables algorithms that reduce
floating point rounding, e.g. Kahan summation, and certain functions
like expf() will be vectorized but handle underflows less gracefully.
It's unspecified whether llamafile runs in fast or precise math mode
when neither flag is specified.
.It Fl Fl precise
Put llamafile into precise math mode. This enables algorithms that
reduce floating point rounding, e.g. Kahan summation, and certain
functions like expf() will always handle subnormals correctly. It's
unspecified whether llamafile runs in fast or precise math mode when
neither flag is specified.
.It Fl Fl trap
Put llamafile into math trapping mode. When floating point exceptions
occur, such as NaNs, overflow, and divide by zero, llamafile will print
a warning to the console. This warning will include a C++ backtrace the
first time an exception is trapped. The op graph will also be dumped to
a file, and llamafile will report the specific op where the exception
occurred. This is useful for troubleshooting when reporting issues.
USing this feature will disable sandboxing. Math trapping is only
possible if your CPU supports it. That is generally the case on AMD64,
however it's less common on ARM64.
.It Fl Fl prompt-cache Ar FNAME
File to cache prompt state for faster startup.
.Pp
Default: none
.It Fl fa Ar FNAME , Fl Fl flash-attn
Enable Flash Attention. This is a mathematical shortcut that can speed
up inference for certain models. This feature is still under active
development.
.It Fl Fl prompt-cache-all
If specified, saves user input and generations to cache as well. Not supported with
.Fl Fl interactive
or other interactive options.
.It Fl Fl prompt-cache-ro
If specified, uses the prompt cache but does not update it.
.It Fl Fl random-prompt
Start with a randomized prompt.
.It Fl Fl image Ar IMAGE_FILE
Path to an image file. This should be used with multimodal models.
Alternatively, it's possible to embed an image directly into the prompt
instead; in which case, it must be base64 encoded into an HTML img tag
URL with the image/jpeg MIME type. See also the
.Fl Fl mmproj
flag for supplying the vision model.
.It Fl i , Fl Fl interactive
Run the program in interactive mode, allowing users to engage in
real-time conversations or provide specific instructions to the model.
.It Fl Fl interactive-first
Run the program in interactive mode and immediately wait for user input
before starting the text generation.
.It Fl ins , Fl Fl instruct
Run the program in instruction mode, which is specifically designed to
work with Alpaca models that excel in completing tasks based on user
instructions.
.Pp
Technical details: The user's input is internally prefixed with the
reverse prompt (or \[dq]### Instruction:\[dq] as the default), and
followed by \[dq]### Response:\[dq] (except if you just press Return
without any input, to keep generating a longer response).
.Pp
By understanding and utilizing these interaction options, you can create
engaging and dynamic experiences with the LLaMA models, tailoring the
text generation process to your specific needs.
.It Fl r Ar PROMPT , Fl Fl reverse-prompt Ar PROMPT
Specify one or multiple reverse prompts to pause text generation and
switch to interactive mode. For example,
.Fl r Ar \[dq]User:\[dq]
can be used to jump back into the conversation whenever it's the user's
turn to speak. This helps create a more interactive and conversational
experience. However, the reverse prompt doesn't work when it ends with a
space. To overcome this limitation, you can use the
.Fl Fl in-prefix
flag to add a space or any other characters after the reverse prompt.
.It Fl Fl color
Enable colorized output to differentiate visually distinguishing between
prompts, user input, and generated text.
.It Fl Fl no-display-prompt , Fl Fl silent-prompt
Don't echo the prompt itself to standard output.
.It Fl Fl keep Ar N
Specifies number of tokens to keep from the initial prompt. The default
is -1 which means all tokens.
.It Fl Fl multiline-input
Allows you to write or paste multiple lines without ending each in '\[rs]'.
.It Fl Fl cont-batching
Enables continuous batching, a.k.a. dynamic batching.
is -1 which means all tokens.
.It Fl Fl embedding
In CLI mode, the embedding flag may be use to print embeddings to
standard output. By default, embeddings are computed over a whole
prompt. However the
.Fl Fl multiline
flag may be passed, to have a separate embeddings array computed for
each line of text in the prompt. In multiline mode, each embedding array
will be printed on its own line to standard output, where individual
floats are separated by space. If both the
.Fl Fl multiline-input
and
.Fl Fl interactive
flags are passed, then a pretty-printed summary of embeddings along with
a cosine similarity matrix will be printed to the terminal.
.El
.Sh SERVER OPTIONS
The following options may be specified when
.Nm
is running in
.Fl Fl server
mode.
.Bl -tag -width indent
.It Fl Fl port Ar PORT
Port to listen
.Pp
Default: 8080
.It Fl Fl host Ar IPADDR
IP address to listen.
.Pp
Default: 127.0.0.1
.It Fl to Ar N , Fl Fl timeout Ar N
Server read/write timeout in seconds.
.Pp
Default: 600
.It Fl np Ar N , Fl Fl parallel Ar N
Number of slots for process requests.
.Pp
Default: 1
.It Fl cb , Fl Fl cont-batching
Enable continuous batching (a.k.a dynamic batching).
.Pp
Default: disabled
.It Fl spf Ar FNAME , Fl Fl system-prompt-file Ar FNAME
Set a file to load a system prompt (initial prompt of all slots), this
is useful for chat applications.
.It Fl a Ar ALIAS , Fl Fl alias Ar ALIAS
Set an alias for the model. This will be added as the
.Ar model
field in completion responses.
.It Fl Fl path Ar PUBLIC_PATH
Path from which to serve static files.
.Pp
Default:
.Pa /zip/llama.cpp/server/public
.It Fl Fl nobrowser
Do not attempt to open a web browser tab at startup.
.It Fl gan Ar N , Fl Fl grp-attn-n Ar N
Set the group attention factor to extend context size through
self-extend. The default value is
.Ar 1
which means disabled. This flag is used together with
.Fl Fl grp-attn-w .
.It Fl gaw Ar N , Fl Fl grp-attn-w Ar N
Set the group attention width to extend context size through
self-extend. The default value is
.Ar 512 .
This flag is used together with
.Fl Fl grp-attn-n .
.El
.Sh LOG OPTIONS
The following log options are available:
.Bl -tag -width indent
.It Fl ld Ar LOGDIR , Fl Fl logdir Ar LOGDIR
Path under which to save YAML logs (no logging if unset)
.It Fl Fl log-test
Run simple logging test
.It Fl Fl log-disable
Disable trace logs
.It Fl Fl log-enable
Enable trace logs
.It Fl Fl log-file
Specify a log filename (without extension)
.It Fl Fl log-new
Create a separate new log file on start. Each log file will have unique name:
.Fa <name>.<ID>.log
.It Fl Fl log-append
Don't truncate the old log file.
.El
.Sh EXAMPLES
Here's an example of how to run llama.cpp's built-in HTTP server. This
example uses LLaVA v1.5-7B, a multimodal LLM that works with llama.cpp's
recently-added support for image inputs.
.Bd -literal -offset indent
llamafile \[rs]
  -m llava-v1.5-7b-Q8_0.gguf \[rs]
  --mmproj llava-v1.5-7b-mmproj-Q8_0.gguf \[rs]
  --host 0.0.0.0
.Ed
.Pp
Here's an example of how to generate code for a libc function using the
llama.cpp command line interface, utilizing WizardCoder-Python-13B
weights:
.Bd -literal -offset indent
llamafile \[rs]
  -m wizardcoder-python-13b-v1.0.Q8_0.gguf --temp 0 -r '}\[rs]n' -r '\`\`\`\[rs]n' \[rs]
  -e -p '\`\`\`c\[rs]nvoid *memcpy(void *dst, const void *src, size_t size) {\[rs]n'
.Ed
.Pp
Here's a similar example that instead utilizes Mistral-7B-Instruct
weights for prose composition:
.Bd -literal -offset indent
llamafile \[rs]
  -m mistral-7b-instruct-v0.2.Q5_K_M.gguf \[rs]
  -p '[INST]Write a story about llamas[/INST]'
.Ed
.Pp
Here's an example of how llamafile can be used as an interactive chatbot
that lets you query knowledge contained in training data:
.Bd -literal -offset indent
llamafile -m llama-65b-Q5_K.gguf -p '
The following is a conversation between a Researcher and their helpful AI
assistant Digital Athena which is a large language model trained on the
sum of human knowledge.
Researcher: Good morning.
Digital Athena: How can I help you today?
Researcher:' --interactive --color --batch_size 1024 --ctx_size 4096 \[rs]
--keep -1 --temp 0 --mirostat 2 --in-prefix ' ' --interactive-first \[rs]
--in-suffix 'Digital Athena:' --reverse-prompt 'Researcher:'
.Ed
.Pp
Here's an example of how you can use llamafile to summarize HTML URLs:
.Bd -literal -offset indent
(
  echo '[INST]Summarize the following text:'
  links -codepage utf-8 \[rs]
        -force-html \[rs]
        -width 500 \[rs]
        -dump https://www.poetryfoundation.org/poems/48860/the-raven |
    sed 's/   */ /g'
  echo '[/INST]'
) | llamafile \[rs]
      -m mistral-7b-instruct-v0.2.Q5_K_M.gguf \[rs]
      -f /dev/stdin \[rs]
      -c 0 \[rs]
      --temp 0 \[rs]
      -n 500 \[rs]
      --no-display-prompt 2>/dev/null
.Ed
.Pp
Here's how you can use llamafile to describe a jpg/png/gif/bmp image:
.Bd -literal -offset indent
llamafile --temp 0 \[rs]
  --image lemurs.jpg \[rs]
  -m llava-v1.5-7b-Q4_K.gguf \[rs]
  --mmproj llava-v1.5-7b-mmproj-Q4_0.gguf \[rs]
  -e -p '### User: What do you see?\[rs]n### Assistant: ' \[rs]
  --no-display-prompt 2>/dev/null
.Ed
.Pp
If you wanted to write a script to rename all your image files, you
could use the following command to generate a safe filename:
.Bd -literal -offset indent
llamafile --temp 0 \[rs]
    --image ~/Pictures/lemurs.jpg \[rs]
    -m llava-v1.5-7b-Q4_K.gguf \[rs]
    --mmproj llava-v1.5-7b-mmproj-Q4_0.gguf \[rs]
    --grammar 'root ::= [a-z]+ (" " [a-z]+)+' \[rs]
    -e -p '### User: The image has...\[rs]n### Assistant: ' \[rs]
    --no-display-prompt 2>/dev/null |
  sed -e's/ /_/g' -e's/$/.jpg/'
three_baby_lemurs_on_the_back_of_an_adult_lemur.jpg
.Ed
.Pp
Here's an example of how to make an API request to the OpenAI API
compatible completions endpoint when your
.Nm
is running in the background in
.Fl Fl server
mode.
.Bd -literal -offset indent
curl -s http://localhost:8080/v1/chat/completions \[rs]
     -H "Content-Type: application/json" -d '{
  "model": "gpt-3.5-turbo",
  "stream": true,
  "messages": [
    {
      "role": "system",
      "content": "You are a poetic assistant."
    },
    {
      "role": "user",
      "content": "Compose a poem that explains FORTRAN."
    }
  ]
}' | python3 -c '
import json
import sys
json.dump(json.load(sys.stdin), sys.stdout, indent=2)
print()
'
.Ed
.Sh PROTIP
The
.Fl ngl Ar 35
flag needs to be passed in order to use GPUs made by NVIDIA and AMD.
It's not enabled by default since it sometimes needs to be tuned based
on the system hardware and model architecture, in order to achieve
optimal performance, and avoid compromising a shared display.
.Sh SEE ALSO
.Xr llamafile-quantize 1 ,
.Xr llamafile-perplexity 1 ,
.Xr llava-quantize 1 ,
.Xr zipalign 1 ,
.Xr unzip 1