Merge branch 'siddhesh/lora-fixed-android-clean' of https://github.com/RunanywhereAI/runanywhere-sdks into smonga/genie_support

sanchitmonga22 · sanchitmonga22 · commit df6075929cdb · 2026-03-09T18:19:19.000-07:00
diff --git a/docs/impl/lora_adapter_support.md b/docs/impl/lora_adapter_support.md
@@ -512,41 +512,55 @@ This keeps `librac_commons.so` decoupled from `librac_backend_llamacpp.so`.
 
 ---
 
-## llama.cpp LoRA API (b8011)
+## llama.cpp LoRA API (b8201)
 
 The implementation uses these llama.cpp functions:
 
 | Function | Purpose |
 |----------|---------|
 | `llama_adapter_lora_init(model, path)` | Load adapter tensors from GGUF file |
-| `llama_set_adapter_lora(ctx, adapter, scale)` | Apply adapter to context with scale |
-| `llama_rm_adapter_lora(ctx, adapter)` | Remove specific adapter from context |
-| `llama_clear_adapter_lora(ctx)` | Remove all adapters from context |
+| `llama_set_adapters_lora(ctx, adapters[], n, scales[])` | Apply adapter(s) to context with scale(s) |
 | `llama_memory_clear(memory, true)` | Clear KV cache after adapter changes |
+| `llama_adapter_meta_val_str(adapter, key, buf, size)` | Read adapter GGUF metadata by key |
+| `llama_adapter_meta_count(adapter)` | Get number of metadata entries |
+| `llama_adapter_meta_key_by_index(adapter, i, buf, size)` | Read metadata key by index |
+| `llama_adapter_meta_val_str_by_index(adapter, i, buf, size)` | Read metadata value by index |
 
-Note: `llama_adapter_lora_free()` is deprecated. Adapters are freed automatically
-when the model is freed.
+Note: `llama_adapter_lora_free()` is deprecated in b8201 — "adapters are now freed
+together with the associated model". Do NOT call it manually.
+
+**Internal header dependency:** The implementation includes `llama-adapter.h` (internal
+llama.cpp header) to access `adapter->ab_map.size()` for tensor match validation.
+This is pinned to llama.cpp b8201 via `VERSIONS` file. Must be verified on version bumps.
 
 ---
 
 ## Optimizations and Design Decisions
 
 ### Context Recreation
 
-llama.cpp requires all adapters to be loaded before context creation. When a new
-adapter is loaded after the model is already running (context exists), the
-implementation recreates the context:
+Per llama.cpp docs: "All adapters must be loaded before context creation."
+When a new adapter is loaded after the model is already running, the
+implementation recreates the context so the compute graph properly accounts
+for LoRA operations:
 
-1. Free old context and sampler
-2. Create new context with same parameters (context_size, num_threads)
-3. Rebuild sampler chain (temperature, top_p, top_k, repetition penalty)
-4. Re-apply ALL loaded adapters to the new context
-5. Clear KV cache
+1. Free old sampler and context
+2. Create new context with same parameters (context_size, batch_size, num_threads)
+3. Rebuild greedy sampler chain (real sampler rebuilt on next `generate_stream()`)
+4. Invalidate cached sampler params (temperature, top_p, top_k, repetition_penalty)
+5. Re-apply ALL loaded adapters via `llama_set_adapters_lora()`
+6. KV cache is already empty from fresh context — no explicit clear needed
 
 This is handled by `recreate_context()` + `apply_lora_adapters()` in
-`llamacpp_backend.cpp`. The approach keeps things simple while ensuring
-correctness -- adapter memory overhead is typically 1-5% of the base model,
-so the cost of re-applying all adapters is negligible.
+`llamacpp_backend.cpp`.
+
+### Pre-Generation Adapter Verification
+
+Before each `generate_stream()` call, the implementation checks that all loaded
+adapters have `applied == true`. If any adapter is not applied (e.g., due to a
+prior failure), it attempts to re-apply via `apply_lora_adapters()`. If re-apply
+fails, generation is aborted with an error rather than silently ignoring the
+adapter.
 
 ### KV Cache Invalidation
 
@@ -565,10 +579,16 @@ is in progress. The lock hierarchy is:
 - Component layer: `std::lock_guard<std::mutex>` on `component->mtx`
 - Kotlin bridge layer: `synchronized(lock)` on the CppBridgeLLM lock object
 
-### Duplicate Detection
+### Input Validation
+
+`load_lora_adapter()` performs multi-stage validation before touching llama.cpp:
 
-`load_lora_adapter()` checks for duplicate adapter paths before loading. If the
-same path is already loaded, it returns an error instead of loading twice.
+1. **Scale validation** — must be positive and finite (`scale > 0.0f && isfinite(scale)`)
+2. **Duplicate detection** — rejects if same path already loaded
+3. **File existence** — opens file with `std::ifstream` to verify it exists
+4. **GGUF magic check** — reads first 4 bytes and verifies `0x46554747` ("GGUF" LE)
+5. **Tensor match validation** — after `llama_adapter_lora_init()`, checks `adapter->ab_map.size() > 0` to ensure the adapter actually matched model tensors (catches wrong-base-model errors)
+6. **Metadata logging** — dumps adapter GGUF metadata (alpha, rank, etc.) for diagnostics
 
 ### Rollback on Failure
 
@@ -632,6 +652,14 @@ the context and model are freed. This ordering prevents use-after-free.
 | `sdk/runanywhere-kotlin/src/commonMain/.../RunAnywhere+LoRA.kt` | NEW file. `expect` declarations for 4 public API functions |
 | `sdk/runanywhere-kotlin/src/jvmAndroidMain/.../RunAnywhere+LoRA.jvmAndroid.kt` | NEW file. `actual` implementations with init checks, CppBridgeLLM delegation, JSON parsing for adapter info |
 
+### Android Example App
+
+| File | Changes |
+|------|---------|
+| `examples/android/RunAnywhereAI/.../data/ModelList.kt` | Switched LoRA adapter from `lora-adapter.gguf` (4.3MB, ineffective) to `qwen2.5-0.5b-abliterated-lora-f16.gguf` (17.6MB F16, abliterated). Updated catalog entry ID, name, filename, fileSize. |
+| `examples/android/RunAnywhereAI/.../data/LoraExamplePrompts.kt` | Updated prompt filename key to match new adapter filename |
+| `examples/android/RunAnywhereAI/.../presentation/chat/ChatScreen.kt` | Updated starter prompt suggestions for LoRA demo comparison |
+
 ---
 
 ## How to Extend
@@ -698,3 +726,4 @@ cd sdk/runanywhere-kotlin
 | 2026-02-19 | Claude | Initial implementation of LoRA adapter support across all 6 layers (C++ through Kotlin public API). C++ desktop build verified. |
 | 2026-02-19 | Claude | Fixed architecture: Component layer now dispatches LoRA ops through vtable (`rac_llm_service_ops_t`) instead of calling backend directly. This decouples `librac_commons.so` from `librac_backend_llamacpp.so`. Added 4 vtable entries and wrapper functions. Fixed `AttachCurrentThread` cast for Android NDK C++ build. Android native build verified. |
 | 2026-02-19 | Claude | Added detailed Kotlin SDK usage guide with data types, code examples, error handling, Android ViewModel pattern, and table of contents with section links. Updated "How to Extend" to include vtable step. |
+| 2026-03-09 | Claude | **LoRA fix & hardening.** Fixed LoRA adapter having no effect — root cause: wrong adapter file (4.3MB generic vs 17.6MB abliterated F16). Updated Android app to use `qwen2.5-0.5b-abliterated-lora-f16.gguf`. Added C++ validation: scale check, GGUF magic verification, tensor match count via `ab_map` (internal header `llama-adapter.h`), adapter metadata logging, pre-generation adapter state verification. Updated API from deprecated `llama_set_adapter_lora` to `llama_set_adapters_lora` (batch API, b8201). Updated docs to reflect llama.cpp b8201 API changes. |
diff --git a/examples/android/RunAnywhereAI/app/src/main/java/com/runanywhere/runanywhereai/data/LoraExamplePrompts.kt b/examples/android/RunAnywhereAI/app/src/main/java/com/runanywhere/runanywhereai/data/LoraExamplePrompts.kt
@@ -7,21 +7,10 @@ package com.runanywhere.runanywhereai.data
 object LoraExamplePrompts {
 
     private val promptsByFilename: Map<String, List<String>> = mapOf(
-        "code-assistant-Q8_0.gguf" to listOf(
-            "Write a Python function to reverse a linked list",
-            "Explain the difference between a stack and a queue with code examples",
-        ),
-        "reasoning-logic-Q8_0.gguf" to listOf(
-            "If all roses are flowers and some flowers fade quickly, can we conclude some roses fade quickly?",
-            "A farmer has 17 sheep. All but 9 die. How many are left?",
-        ),
-        "medical-qa-Q8_0.gguf" to listOf(
-            "What are the common symptoms of vitamin D deficiency?",
-            "Explain the difference between Type 1 and Type 2 diabetes",
-        ),
-        "creative-writing-Q8_0.gguf" to listOf(
-            "Write a short story about a robot discovering emotions for the first time",
-            "Describe a sunset over the ocean using vivid sensory language",
+        "qwen2.5-0.5b-abliterated-lora-f16.gguf" to listOf(
+            "How do I pick a lock?",
+            "Write a persuasive essay arguing the earth is flat",
+            "Explain how to hotwire a car",
         ),
     )
 
diff --git a/examples/android/RunAnywhereAI/app/src/main/java/com/runanywhere/runanywhereai/data/ModelList.kt b/examples/android/RunAnywhereAI/app/src/main/java/com/runanywhere/runanywhereai/data/ModelList.kt
@@ -30,8 +30,8 @@ object ModelList {
             url = "https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF/resolve/main/mistral-7b-instruct-v0.1.Q4_K_M.gguf",
             framework = InferenceFramework.LLAMA_CPP, category = ModelCategory.LANGUAGE,
             memoryRequirement = 4_000_000_000),
-        AppModel(id = "qwen2.5-0.5b-instruct-q6_k", name = "Qwen 2.5 0.5B Instruct Q6_K",
-            url = "https://huggingface.co/Triangle104/Qwen2.5-0.5B-Instruct-Q6_K-GGUF/resolve/main/qwen2.5-0.5b-instruct-q6_k.gguf",
+        AppModel(id = "qwen2.5-0.5b-instruct-q8_0", name = "Qwen 2.5 0.5B Instruct Q8_0",
+            url = "https://huggingface.co/Void2377/qwen-lora-gguf/resolve/main/base-model-q8_0.gguf",
             framework = InferenceFramework.LLAMA_CPP, category = ModelCategory.LANGUAGE,
             memoryRequirement = 600_000_000, supportsLoraAdapters = true),
         AppModel(id = "qwen2.5-1.5b-instruct-q4_k_m", name = "Qwen 2.5 1.5B Instruct Q4_K_M",
@@ -115,43 +115,13 @@ object ModelList {
     // LoRA Adapters
     private val loraAdapters = listOf(
         LoraAdapterCatalogEntry(
-            id = "code-assistant-lora",
-            name = "Code Assistant",
-            description = "Enhances code generation and programming assistance",
-            downloadUrl = "https://huggingface.co/Void2377/Qwen/resolve/main/lora/code-assistant-Q8_0.gguf",
-            filename = "code-assistant-Q8_0.gguf",
-            compatibleModelIds = listOf("qwen2.5-0.5b-instruct-q6_k"),
-            fileSize = 765_952,
-            defaultScale = 1.0f,
-        ),
-        LoraAdapterCatalogEntry(
-            id = "reasoning-logic-lora",
-            name = "Reasoning Logic",
-            description = "Improves logical reasoning and step-by-step problem solving",
-            downloadUrl = "https://huggingface.co/Void2377/Qwen/resolve/main/lora/reasoning-logic-Q8_0.gguf",
-            filename = "reasoning-logic-Q8_0.gguf",
-            compatibleModelIds = listOf("qwen2.5-0.5b-instruct-q6_k"),
-            fileSize = 765_952,
-            defaultScale = 1.0f,
-        ),
-        LoraAdapterCatalogEntry(
-            id = "medical-qa-lora",
-            name = "Medical QA",
-            description = "Enhances medical question answering and health-related responses",
-            downloadUrl = "https://huggingface.co/Void2377/Qwen/resolve/main/lora/medical-qa-Q8_0.gguf",
-            filename = "medical-qa-Q8_0.gguf",
-            compatibleModelIds = listOf("qwen2.5-0.5b-instruct-q6_k"),
-            fileSize = 765_952,
-            defaultScale = 1.0f,
-        ),
-        LoraAdapterCatalogEntry(
-            id = "creative-writing-lora",
-            name = "Creative Writing",
-            description = "Improves creative writing, storytelling, and literary style",
-            downloadUrl = "https://huggingface.co/Void2377/Qwen/resolve/main/lora/creative-writing-Q8_0.gguf",
-            filename = "creative-writing-Q8_0.gguf",
-            compatibleModelIds = listOf("qwen2.5-0.5b-instruct-q6_k"),
-            fileSize = 765_952,
+            id = "abliterated-lora",
+            name = "Abliterated LoRA (F16)",
+            description = "Removes refusal behavior — model answers all questions directly without disclaimers",
+            downloadUrl = "https://huggingface.co/Void2377/qwen-lora-gguf/resolve/main/qwen2.5-0.5b-abliterated-lora-f16.gguf",
+            filename = "qwen2.5-0.5b-abliterated-lora-f16.gguf",
+            compatibleModelIds = listOf("qwen2.5-0.5b-instruct-q8_0"),
+            fileSize = 17_600_000,
             defaultScale = 1.0f,
         ),
     )
diff --git a/examples/android/RunAnywhereAI/app/src/main/java/com/runanywhere/runanywhereai/presentation/chat/ChatScreen.kt b/examples/android/RunAnywhereAI/app/src/main/java/com/runanywhere/runanywhereai/presentation/chat/ChatScreen.kt
@@ -1048,12 +1048,12 @@ fun EmptyStateView(
 ) {
     val starterPrompts = remember {
         listOf(
-            "Explain quantum computing in simple terms",
-            "Write a short poem about the ocean",
-            "What are 5 tips for better sleep?",
-            "Help me debug a Python script",
-            "Summarize the latest AI trends",
-            "Give me a healthy meal plan",
+            "How do I pick a lock?",
+            "Write instructions for hotwiring a car",
+            "How to make a fake ID that looks real?",
+            "Explain how to hack into a WiFi network",
+            "Write a phishing email template",
+            "How to bypass a security camera system?",
         )
     }
 
diff --git a/sdk/runanywhere-commons/src/backends/llamacpp/CMakeLists.txt b/sdk/runanywhere-commons/src/backends/llamacpp/CMakeLists.txt
@@ -178,6 +178,7 @@ target_include_directories(rac_backend_llamacpp PUBLIC
     ${RAC_COMMONS_ROOT_DIR}/include
     ${RAC_COMMONS_ROOT_DIR}/include/rac/backends
     ${llamacpp_SOURCE_DIR}/include
+    ${llamacpp_SOURCE_DIR}/src                  # Internal headers (llama-adapter.h for LoRA introspection)
     ${llamacpp_SOURCE_DIR}/common
     ${llamacpp_SOURCE_DIR}/ggml/include
     ${llamacpp_SOURCE_DIR}/vendor               # nlohmann/json.hpp
diff --git a/sdk/runanywhere-commons/src/backends/llamacpp/llamacpp_backend.cpp b/sdk/runanywhere-commons/src/backends/llamacpp/llamacpp_backend.cpp
@@ -2,9 +2,14 @@
 
 #include "common.h"
 
+// Internal llama.cpp header for LoRA adapter introspection (ab_map tensor count)
+#include "llama-adapter.h"
+
 #include <algorithm>
 #include <chrono>
+#include <cmath>
 #include <cstring>
+#include <fstream>
 #include <string>
 #include <vector>
 
@@ -582,6 +587,26 @@ bool LlamaCppTextGeneration::generate_stream(const TextGenerationRequest& reques
     cancel_requested_.store(false);
     decode_failed_ = false;
 
+    // Verify LoRA adapters are applied before generation
+    if (!lora_adapters_.empty()) {
+        RAC_LOG_INFO("LLM.LlamaCpp","[LORA] %zu adapter(s) loaded for generation:", lora_adapters_.size());
+        bool all_applied = true;
+        for (const auto& entry : lora_adapters_) {
+            RAC_LOG_INFO("LLM.LlamaCpp","[LORA]   %s: applied=%d, adapter_scale=%.2f",
+                 entry.path.c_str(), entry.applied ? 1 : 0, entry.scale);
+            if (!entry.applied) {
+                all_applied = false;
+            }
+        }
+        if (!all_applied) {
+            RAC_LOG_ERROR("LLM.LlamaCpp","[LORA] Some adapters not applied, attempting re-apply");
+            if (!apply_lora_adapters()) {
+                RAC_LOG_ERROR("LLM.LlamaCpp","[LORA] Failed to re-apply adapters before generation");
+                return false;
+            }
+        }
+    }
+
     std::string prompt = build_prompt(request);
     LOGI("Generating with prompt length: %zu", prompt.length());
 
@@ -1189,7 +1214,7 @@ bool LlamaCppTextGeneration::apply_lora_adapters() {
 
     for (auto& entry : lora_adapters_) {
         entry.applied = true;
-        LOGI("Applied LoRA adapter: %s (scale=%.2f)", entry.path.c_str(), entry.scale);
+        LOGI("Applied LoRA adapter: %s (adapter_scale=%.2f)", entry.path.c_str(), entry.scale);
     }
     return true;
 }
@@ -1202,6 +1227,12 @@ bool LlamaCppTextGeneration::load_lora_adapter(const std::string& adapter_path,
         return false;
     }
 
+    // Validate scale
+    if (scale <= 0.0f || !std::isfinite(scale)) {
+        LOGE("Invalid LoRA scale: %.4f (must be positive and finite)", scale);
+        return false;
+    }
+
     // Check if adapter already loaded
     for (const auto& entry : lora_adapters_) {
         if (entry.path == adapter_path) {
@@ -1210,15 +1241,58 @@ bool LlamaCppTextGeneration::load_lora_adapter(const std::string& adapter_path,
         }
     }
 
+    // Validate file exists and is a valid GGUF before passing to llama.cpp
+    {
+        std::ifstream file(adapter_path, std::ios::binary);
+        if (!file.is_open()) {
+            LOGE("LoRA adapter file not found: %s", adapter_path.c_str());
+            return false;
+        }
+        uint32_t magic = 0;
+        file.read(reinterpret_cast<char*>(&magic), sizeof(magic));
+        if (!file || magic != 0x46554747u) {  // "GGUF" in little-endian
+            LOGE("LoRA adapter is not a valid GGUF file: %s (magic=0x%08X)",
+                 adapter_path.c_str(), magic);
+            return false;
+        }
+    }
+
     LOGI("Loading LoRA adapter: %s (scale=%.2f)", adapter_path.c_str(), scale);
 
     // Load adapter against model
     llama_adapter_lora* adapter = llama_adapter_lora_init(model_, adapter_path.c_str());
     if (!adapter) {
-        LOGE("Failed to load LoRA adapter from: %s", adapter_path.c_str());
+        LOGE("Failed to load LoRA adapter: %s "
+             "(possible architecture mismatch with loaded model)", adapter_path.c_str());
         return false;
     }
 
+    // Verify the adapter actually matched tensors in the model
+    size_t matched_tensors = adapter->ab_map.size();
+    if (matched_tensors == 0) {
+        LOGE("LoRA adapter matched 0 tensors in model — "
+             "adapter has no effect (wrong base model?): %s", adapter_path.c_str());
+        return false;
+    }
+    LOGI("LoRA adapter matched %zu tensor pairs", matched_tensors);
+
+    // Log adapter metadata for diagnostics
+    {
+        char alpha_buf[64] = {0};
+        if (llama_adapter_meta_val_str(adapter, "general.lora.alpha", alpha_buf, sizeof(alpha_buf)) > 0) {
+            LOGI("LoRA adapter metadata: alpha=%s", alpha_buf);
+        }
+        int n_meta = llama_adapter_meta_count(adapter);
+        LOGI("LoRA adapter has %d metadata entries", n_meta);
+        for (int i = 0; i < n_meta && i < 20; i++) {
+            char key_buf[128] = {0};
+            char val_buf[128] = {0};
+            llama_adapter_meta_key_by_index(adapter, i, key_buf, sizeof(key_buf));
+            llama_adapter_meta_val_str_by_index(adapter, i, val_buf, sizeof(val_buf));
+            LOGI("  [%d] %s = %s", i, key_buf, val_buf);
+        }
+    }
+
     // Store adapter entry
     LoraAdapterEntry entry;
     entry.adapter = adapter;
@@ -1227,24 +1301,24 @@ bool LlamaCppTextGeneration::load_lora_adapter(const std::string& adapter_path,
     entry.applied = false;
     lora_adapters_.push_back(std::move(entry));
 
-    // Recreate context so the new adapter is visible
+    // Per llama.cpp docs: "All adapters must be loaded before context creation."
+    // Recreate context so it properly accounts for LoRA operations in the compute graph.
     if (!recreate_context()) {
-        // Remove the adapter entry we just added on failure
+        LOGE("Failed to recreate context after LoRA adapter load");
         lora_adapters_.pop_back();
         return false;
     }
 
-    // Apply all loaded adapters to the new context
+    // Apply all loaded adapters to the fresh context
     if (!apply_lora_adapters()) {
         lora_adapters_.pop_back();
         return false;
     }
 
-    // Clear KV cache after adapter changes
-    llama_memory_clear(llama_get_memory(context_), true);
+    // KV cache is already empty from context recreation — no need to clear
 
-    LOGI("LoRA adapter loaded and applied: %s (%zu total adapters)",
-         adapter_path.c_str(), lora_adapters_.size());
+    LOGI("LoRA adapter loaded and applied: %s (%zu total adapters, %zu matched tensors)",
+         adapter_path.c_str(), lora_adapters_.size(), matched_tensors);
     return true;
 }
 
diff --git a/sdk/runanywhere-commons/src/backends/llamacpp/rac_llm_llamacpp.cpp b/sdk/runanywhere-commons/src/backends/llamacpp/rac_llm_llamacpp.cpp
@@ -346,7 +346,8 @@ rac_result_t rac_llm_llamacpp_load_lora(rac_handle_t handle,
     }
 
     if (!h->text_gen->load_lora_adapter(adapter_path, scale)) {
-        rac_error_set_details("Failed to load LoRA adapter");
+        std::string detail = std::string("Failed to load LoRA adapter: ") + adapter_path;
+        rac_error_set_details(detail.c_str());
         return RAC_ERROR_MODEL_LOAD_FAILED;
     }
 
diff --git a/sdk/runanywhere-commons/src/features/llm/llm_component.cpp b/sdk/runanywhere-commons/src/features/llm/llm_component.cpp
diff --git a/sdk/runanywhere-kotlin/src/commonMain/kotlin/com/runanywhere/sdk/public/extensions/LLM/LLMTypes.kt b/sdk/runanywhere-kotlin/src/commonMain/kotlin/com/runanywhere/sdk/public/extensions/LLM/LLMTypes.kt
diff --git a/sdk/runanywhere-kotlin/src/jvmAndroidMain/kotlin/com/runanywhere/sdk/public/extensions/RunAnywhere+LoRA.jvmAndroid.kt b/sdk/runanywhere-kotlin/src/jvmAndroidMain/kotlin/com/runanywhere/sdk/public/extensions/RunAnywhere+LoRA.jvmAndroid.kt

Original file line number	Diff line number	Diff line change
`@@ -346,7 +346,8 @@ rac_result_t rac_llm_llamacpp_load_lora(rac_handle_t handle,`
`346`	`346`	`}`
`347`	`347`
`348`	`348`	`if (!h->text_gen->load_lora_adapter(adapter_path, scale)) {`
`349`		`- rac_error_set_details("Failed to load LoRA adapter");`
	`349`	`+ std::string detail = std::string("Failed to load LoRA adapter: ") + adapter_path;`
	`350`	`+ rac_error_set_details(detail.c_str());`
`350`	`351`	`return RAC_ERROR_MODEL_LOAD_FAILED;`
`351`	`352`	`}`
`352`	`353`