Llama.cpp Model Deployment
The llama.cpp Model Deployment App is available under the ClearML Enterprise plan.
The llama.cpp Model Deployment app enables users to quickly deploy LLM models in GGUF format using llama.cpp
.
The llama.cpp Model Deployment application serves your model on a machine of your choice. Once an app instance is
running, it serves your model through a secure, publicly accessible network endpoint.
The app supports multi-model hosting and Universal Memory technology, enabling inactive models to be offloaded to other memory options to free GPU resources:
- CPU RAM – via
Automatic CPU Offloading
and configurableMax CUDA Memory
limits. - Disk storage – via
Disk Swapping
(requiresAutomatic CPU Offloading
to be disabled).
The app monitors endpoint activity and shuts down if the model remains inactive over a specified maximum idle time.
The llama.cpp Model Deployment app makes use of the App Gateway Router which implements a secure, authenticated network endpoint for the model.
If the ClearML AI application Gateway is not available, the model endpoint might not be accessible. For more information, see AI Application Gateway.
After starting a llama.cpp Model Deployment instance, you can view the following information in its dashboard:
- Status indicator
- App instance is running and is actively in use
- App instance is setting up
- App instance is idle
- App instance is stopped
- Idle time - Time elapsed since last activity
- Generate Token - Link to your workspace Settings page, where you can generate a token for accessing your deployed model
in the
AI APPLICATION GATEWAY
section - Deployed models table:
- Model name
- Endpoint - The publicly accessible URL of the model endpoint. Active model endpoints are also listed in the Model Endpoints table, which allows you to view and compare endpoint details and monitor their status over time
- Model access command line example
- Select model the command should access
- Prompt - Provide a prompt to send to the model.
- The
curl
command line to send your prompt to the selected model’s endpoint. ReplaceYOUR_GENERATED_TOKEN
with a valid token generated in theAI APPLICATION GATEWAY
section of the Settings page.
- Total Number of Requests - Number of requests over time
- Tokens per Second - Number of tokens processed over time
- Latency - Request response time (ms) over time
- Endpoint resource monitoring metrics over time
- CPU usage
- Network throughput
- Disk performance
- Memory performance
- GPU utilization
- GPU memory usage
- GPU temperature
- Console log - The console log shows the app instance's console output: setup progress, status changes, error messages, etc.
You can embed plots from the app instance dashboard into ClearML Reports. These visualizations
are updated live as the app instance(s) updates. The Enterprise Plan supports embedding resources in
external tools (e.g. Notion). Hover over the plot and click
to copy the embed code, and navigate to a report to paste the embed code.
Llama.cpp Model Deployment Instance Configuration
When configuring a new llama.cpp Model Deployment instance, you can fill in the required parameters or reuse the configuration of a previously launched instance.
Launch an app instance with the configuration of a previously launched instance using one of the following options:
- Cloning a previously launched app instance will open the instance launch form with the original instance's configuration prefilled.
- Importing an app configuration file. You can export the configuration of a previously launched instance as a JSON file when viewing its configuration.
The prefilled configuration form can be edited before launching the new app instance.
To configure a new app instance, click Launch New
to open the app's configuration form.
Configuration Options
- Import Configuration: Import an app instance configuration file. This will fill the configuration form with the values from the file, which can be modified before launching the app instance
- Instance name: Name for the Llama.cpp Model Deployment instance. This will appear in the instance list
- Service Project (Access Control): The ClearML project where the app instance is created. Access is determined by project-level permissions (i.e. users with read access can use the app instance).
- Queue: The ClearML Queue to which the
llama.cpp Model Deployment app instance task will be enqueued (make sure an agent is assigned to it)
AI Gateway Route: Select an available, admin-preconfigured route to use as the service endpoint. If none is selected, an ephemeral endpoint will be created. - Model Configuration: Configure the behavior and performance of the model serving engine.
- CLI: Llama.cpp CLI arguments. If set, these arguments will be passed to Llama.cpp and all following entries will be
ignored, except for the
Model
field. - Verbose: Enable detailed logging
- No MMAP: Disable memory-mapping of model files. May improve performance on some systems but increases memory usage.
- Continuous Batching: Enable continuous batching for processing multiple requests efficiently. Improves throughput for multiple concurrent requests.
- Embedding: Generate embeddings instead of text. Useful for semantic search and text similarity tasks.
- Model: A ClearML Model ID or a Hugging Face model. The model must be in GGUF format. If you are using a HuggingFace model,
make sure to pass the path to the GGUF file. For example:
provider/repo/path/to/model.gguf
- Model Endpoint Name: The name to be used for API access.
- Number of GPU Layers: Number of layers to store in VRAM instead of system RAM (CPU).
9999
loads all layers into VRAM. - Repeat Penalty: Penalty factor to apply due to repeat sequence of tokens. To disable, set to
1.0
. - Temperature: Controls randomness in text generation. The higher the temperature results in lower probability, or more "creative" outputs, while lower temperature results in higher probability, or more predictable outputs.
- Top-K: How many of the highest-probability tokens are considered for text generation
- Top-P: Nucleus sampling threshold. Instead of a fixed number of tokens, selects tokens whose cumulative probability adds up to P. Lower values produce more focused text, higher values more diverse text.
- Min-P: Minimum probability threshold for a token to be considered. Tokens below this probability are excluded regardless of Top-K or Top-P.
- XTC Probability: (Experimental Token Control) Probability of applying the XTC token selection strategy during generation.
- XTC Threshold – The probability threshold used by the XTC strategy to determine whether a token is included in the candidate set.
- Typlical: Locally typical sampling (typical-P)
- Repeat Last N: Last N tokens to consider for penalize
- Context Size: Maximum number of tokens the LLM can process at once when generating a response.
- RoPE Frequency Base: Base frequency for RoPE. Affects how position information is encoded in the model. Default
is
10000.0
. - RoPE Scaling: Scale factor for Rotary Position Embeddings (RoPE). Affects the model's ability to handle longer sequences beyond its training length.
- RoPE Frequency Scale: Frequency scaling factor for Rotary Position Embedding. Adjusts position encoding scale. Used
in conjunction with
RoPE Frequency Base
to fine-tune position embeddings. - YaRN configuration: YaRN (Yet another RoPE extensioN method) is
a compute-efficient method to extend the context window of llama models, requiring less tokens and training steps.
- YaRN Extrapolation Mix Factor: Controls context length extension. Default is 1.0. To disable YaRN, set to
-1.0
. - YaRN Attention Factor: YaRN attention scaling factor. Affects attention computation in extended context. Higher
values increase attention to distant tokens. Default is
1.0
. - YaRN Beta Fast: YaRN fast-path beta parameter. Controls the scaling behavior for nearby token relationships in
extended context. Default is
32.0
. - YaRN Beta Slow: YaRN slow-path beta parameter. Controls the scaling behavior for distant token relationships in
extended context. Default is
1.0
.
- YaRN Extrapolation Mix Factor: Controls context length extension. Default is 1.0. To disable YaRN, set to
- Threads: Number of CPU threads for parallel processing. Higher values can improve performance on multi-core systems but may increase CPU usage.
- Threads Batch: Number of threads for batch processing (separate from main processing threads). Optimizes handling of multiple requests.
- Tensor Split: How split tensors should be distributed across GPUs. Input a comma-separated list of proportions for
splitting tensors across GPUs. For example,
3,2
will assign 60% of the data to GPU 0 and 40% to GPU 1 - Parallel: Number of slots for process requests
- Max Concurrent Requests: The maximum number of concurrent requests for this particular deployment. Having a low limit will deny client requests instead of having them wait for too long
- NUMA: Select the NUMA (non-uniform memory access) optimization mode:
- 'distributed': Splits across nodes
- 'isolate': Runs on a single node
- 'numactl': Uses system NUMA policy.
- Split Mode: Split mode determines how the model is distributed across multiple GPUs. Select model splitting mode:
- 'none': Uses single GPU for the entire model
- 'Layer': Splits the model by layers across multiple GPUs (default)
- 'row': Splits the model’s weight matrices row-wise
- Main GPU: Index of the primary GPU to use (e.g., '0'). When using multiple GPUs, this GPU will handle the main computation load.
- CLI: Llama.cpp CLI arguments. If set, these arguments will be passed to Llama.cpp and all following entries will be
ignored, except for the
- General
- Enable Debug Mode: Run deployment in debug mode
- Disable Logs
- Enable Automatic CPU Offloading: Enable multiple models to share GPUs by offloading idle models to CPU. If
Max CUDA Memory
exceeds GPU capacity, this application will offload the surplus to the CPU RAM, virtually increasing the VRAM - Enable Disk Swapping: Load multiple models on the same GPUs by offloading idle models to disk (requires
Automatic CPU Offloading
to be disabled). - Hugging Face Token: Token for accessing Hugging Face models that require authentication
- Max CUDA Memory (GiB): The maximum amount of CUDA memory identified by the system. Can exceed the actual hardware memory. The surplus memory will be offloaded to the CPU memory. Only usable on amd64 machines.
- CUDA Memory Manager Minimum Threshold: Maximum size (Kb) of the allocated chunks that should not be offloaded to
CPU when using automatic CPU offloading. Defaults to
-1
when running on single GPU, and66000
(64Mib) when running on multiple GPUs
- Idle Options
- Idle Time Limit (Hours): Maximum idle time after which the app instance will shut down
- Last Action Report Interval (Seconds): The frequency at which the last activity made by the application is reported. Used to stop the application from entering an idle state when the machine metrics are low but the application is actually still running