Skip to content

Commit ac2aef2

Browse files
fix(llamacpp): address PR #473 review comments
- Warn when user gpu_layers override is applied after a non-SUCCESS llama_params_fit result: the fit call deliberately fell back to n_gpu_layers=0 to avoid an OOM, and silently reinstating the user's override reintroduces exactly that risk. We still honour the explicit override (existing behaviour) but a RAC_LOG_WARNING now surfaces the risk in logs so it's visible when a subsequent load crashes. (greptile #3082185212) - Clarify in a comment that the zero-initialized tensor_buft_overrides vector already contains a valid llama.cpp sentinel entry (pattern == nullptr), so an empty vector means "no overrides." (greptile #3082185321) - Clamp ctx_params.n_ctx (uint32_t) to INT_MAX before the int cast used in std::min, so a pathological > 2.1B context value can't wrap to negative and win std::min as the "smallest" context size. Added <climits> for INT_MAX. (greptile #3082185505) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 1eb7542 commit ac2aef2

1 file changed

Lines changed: 22 additions & 1 deletion

File tree

sdk/runanywhere-commons/src/backends/llamacpp/llamacpp_backend.cpp

Lines changed: 22 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@
44

55
#include <algorithm>
66
#include <chrono>
7+
#include <climits>
78
#include <cstring>
89
#include <string>
910
#include <vector>
@@ -262,6 +263,10 @@ bool LlamaCppTextGeneration::load_model(const std::string& model_path,
262263
size_t n_overrides = llama_max_tensor_buft_overrides();
263264

264265
std::vector<float> tensor_split(n_devices, 0.0f);
266+
// llama.cpp iterates tensor_buft_overrides until it hits a zero-valued
267+
// sentinel entry (pattern == nullptr). Value-initializing the vector to
268+
// all zeros means the first element is already that sentinel, so an
269+
// empty vector is interpreted as "no tensor buft overrides."
265270
std::vector<llama_model_tensor_buft_override> tensor_buft_overrides(n_overrides);
266271
std::vector<size_t> margins(n_devices, 0);
267272

@@ -342,6 +347,17 @@ bool LlamaCppTextGeneration::load_model(const std::string& model_path,
342347
}
343348
#else
344349
if (user_gpu_layers >= 0) {
350+
// llama_params_fit fell back to n_gpu_layers=0 for non-SUCCESS outcomes;
351+
// honouring the user override here reinstates the OOM risk the fit call
352+
// was supposed to prevent. Log a warning so it's visible in the event of
353+
// a subsequent crash/OOM, but keep honouring the user's explicit request.
354+
if (fit_status != LLAMA_PARAMS_FIT_STATUS_SUCCESS) {
355+
const char* fit_label =
356+
fit_status == LLAMA_PARAMS_FIT_STATUS_FAILURE ? "FAILURE" : "ERROR";
357+
RAC_LOG_WARNING("LLM.LlamaCpp",
358+
"Applying user gpu_layers=%d override despite llama_params_fit %s — risk of OOM",
359+
user_gpu_layers, fit_label);
360+
}
345361
model_params.n_gpu_layers = user_gpu_layers;
346362
RAC_LOG_INFO("LLM.LlamaCpp", "Applying user GPU layers override: %d", user_gpu_layers);
347363
}
@@ -364,7 +380,12 @@ bool LlamaCppTextGeneration::load_model(const std::string& model_path,
364380
if (ctx_params.n_ctx == 0) {
365381
ctx_params.n_ctx = std::min(model_train_ctx, max_default_context_);
366382
}
367-
context_size_ = std::min({(int)ctx_params.n_ctx, model_train_ctx, max_default_context_});
383+
// ctx_params.n_ctx is uint32_t; clamp to INT_MAX before converting to int so
384+
// a pathological fitted/user value above ~2.1B can't wrap to a negative
385+
// number that `std::min` would then pick as the "smallest" context size.
386+
const int fitted_ctx = static_cast<int>(
387+
std::min(ctx_params.n_ctx, static_cast<uint32_t>(INT_MAX)));
388+
context_size_ = std::min({fitted_ctx, model_train_ctx, max_default_context_});
368389

369390
RAC_LOG_INFO("LLM.LlamaCpp", "Final context size: %d (fitted=%u, train=%d, cap=%d)",
370391
context_size_, ctx_params.n_ctx, model_train_ctx, max_default_context_);

0 commit comments

Comments
 (0)