Skip to content

Commit 7be8565

Browse files
Revert "Merge PR #446: Add Genie NPU backend support (Qualcomm Snapdragon)"
This reverts commit 39bc265, reversing changes made to 59fa472.
1 parent 39bc265 commit 7be8565

128 files changed

Lines changed: 730 additions & 4184 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.gitignore

Lines changed: 0 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -388,12 +388,6 @@ tools/
388388
sdk/runanywhere-react-native/packages/rag/ios/.testlocal
389389

390390

391-
# Python virtual environments
392-
.venv*/
393-
venv*/
394-
__pycache__/
395-
*.pyc
396-
397391
# Node
398392
node_modules/
399393
/tools/

.idea/vcs.xml

Lines changed: 1 addition & 13 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.
-219 KB
Binary file not shown.

docs/impl/lora_adapter_support.md

Lines changed: 20 additions & 49 deletions
Original file line numberDiff line numberDiff line change
@@ -512,55 +512,41 @@ This keeps `librac_commons.so` decoupled from `librac_backend_llamacpp.so`.
512512
513513
---
514514
515-
## llama.cpp LoRA API (b8201)
515+
## llama.cpp LoRA API (b8011)
516516
517517
The implementation uses these llama.cpp functions:
518518
519519
| Function | Purpose |
520520
|----------|---------|
521521
| `llama_adapter_lora_init(model, path)` | Load adapter tensors from GGUF file |
522-
| `llama_set_adapters_lora(ctx, adapters[], n, scales[])` | Apply adapter(s) to context with scale(s) |
522+
| `llama_set_adapter_lora(ctx, adapter, scale)` | Apply adapter to context with scale |
523+
| `llama_rm_adapter_lora(ctx, adapter)` | Remove specific adapter from context |
524+
| `llama_clear_adapter_lora(ctx)` | Remove all adapters from context |
523525
| `llama_memory_clear(memory, true)` | Clear KV cache after adapter changes |
524-
| `llama_adapter_meta_val_str(adapter, key, buf, size)` | Read adapter GGUF metadata by key |
525-
| `llama_adapter_meta_count(adapter)` | Get number of metadata entries |
526-
| `llama_adapter_meta_key_by_index(adapter, i, buf, size)` | Read metadata key by index |
527-
| `llama_adapter_meta_val_str_by_index(adapter, i, buf, size)` | Read metadata value by index |
528526
529-
Note: `llama_adapter_lora_free()` is deprecated in b8201 — "adapters are now freed
530-
together with the associated model". Do NOT call it manually.
531-
532-
**Internal header dependency:** The implementation includes `llama-adapter.h` (internal
533-
llama.cpp header) to access `adapter->ab_map.size()` for tensor match validation.
534-
This is pinned to llama.cpp b8201 via `VERSIONS` file. Must be verified on version bumps.
527+
Note: `llama_adapter_lora_free()` is deprecated. Adapters are freed automatically
528+
when the model is freed.
535529
536530
---
537531
538532
## Optimizations and Design Decisions
539533
540534
### Context Recreation
541535
542-
Per llama.cpp docs: "All adapters must be loaded before context creation."
543-
When a new adapter is loaded after the model is already running, the
544-
implementation recreates the context so the compute graph properly accounts
545-
for LoRA operations:
536+
llama.cpp requires all adapters to be loaded before context creation. When a new
537+
adapter is loaded after the model is already running (context exists), the
538+
implementation recreates the context:
546539
547-
1. Free old sampler and context
548-
2. Create new context with same parameters (context_size, batch_size, num_threads)
549-
3. Rebuild greedy sampler chain (real sampler rebuilt on next `generate_stream()`)
550-
4. Invalidate cached sampler params (temperature, top_p, top_k, repetition_penalty)
551-
5. Re-apply ALL loaded adapters via `llama_set_adapters_lora()`
552-
6. KV cache is already empty from fresh context — no explicit clear needed
540+
1. Free old context and sampler
541+
2. Create new context with same parameters (context_size, num_threads)
542+
3. Rebuild sampler chain (temperature, top_p, top_k, repetition penalty)
543+
4. Re-apply ALL loaded adapters to the new context
544+
5. Clear KV cache
553545
554546
This is handled by `recreate_context()` + `apply_lora_adapters()` in
555-
`llamacpp_backend.cpp`.
556-
557-
### Pre-Generation Adapter Verification
558-
559-
Before each `generate_stream()` call, the implementation checks that all loaded
560-
adapters have `applied == true`. If any adapter is not applied (e.g., due to a
561-
prior failure), it attempts to re-apply via `apply_lora_adapters()`. If re-apply
562-
fails, generation is aborted with an error rather than silently ignoring the
563-
adapter.
547+
`llamacpp_backend.cpp`. The approach keeps things simple while ensuring
548+
correctness -- adapter memory overhead is typically 1-5% of the base model,
549+
so the cost of re-applying all adapters is negligible.
564550
565551
### KV Cache Invalidation
566552
@@ -579,16 +565,10 @@ is in progress. The lock hierarchy is:
579565
- Component layer: `std::lock_guard<std::mutex>` on `component->mtx`
580566
- Kotlin bridge layer: `synchronized(lock)` on the CppBridgeLLM lock object
581567
582-
### Input Validation
583-
584-
`load_lora_adapter()` performs multi-stage validation before touching llama.cpp:
568+
### Duplicate Detection
585569
586-
1. **Scale validation** — must be positive and finite (`scale > 0.0f && isfinite(scale)`)
587-
2. **Duplicate detection** — rejects if same path already loaded
588-
3. **File existence** — opens file with `std::ifstream` to verify it exists
589-
4. **GGUF magic check** — reads first 4 bytes and verifies `0x46554747` ("GGUF" LE)
590-
5. **Tensor match validation** — after `llama_adapter_lora_init()`, checks `adapter->ab_map.size() > 0` to ensure the adapter actually matched model tensors (catches wrong-base-model errors)
591-
6. **Metadata logging** — dumps adapter GGUF metadata (alpha, rank, etc.) for diagnostics
570+
`load_lora_adapter()` checks for duplicate adapter paths before loading. If the
571+
same path is already loaded, it returns an error instead of loading twice.
592572
593573
### Rollback on Failure
594574
@@ -652,14 +632,6 @@ the context and model are freed. This ordering prevents use-after-free.
652632
| `sdk/runanywhere-kotlin/src/commonMain/.../RunAnywhere+LoRA.kt` | NEW file. `expect` declarations for 4 public API functions |
653633
| `sdk/runanywhere-kotlin/src/jvmAndroidMain/.../RunAnywhere+LoRA.jvmAndroid.kt` | NEW file. `actual` implementations with init checks, CppBridgeLLM delegation, JSON parsing for adapter info |
654634
655-
### Android Example App
656-
657-
| File | Changes |
658-
|------|---------|
659-
| `examples/android/RunAnywhereAI/.../data/ModelList.kt` | Switched LoRA adapter from `lora-adapter.gguf` (4.3MB, ineffective) to `qwen2.5-0.5b-abliterated-lora-f16.gguf` (17.6MB F16, abliterated). Updated catalog entry ID, name, filename, fileSize. |
660-
| `examples/android/RunAnywhereAI/.../data/LoraExamplePrompts.kt` | Updated prompt filename key to match new adapter filename |
661-
| `examples/android/RunAnywhereAI/.../presentation/chat/ChatScreen.kt` | Updated starter prompt suggestions for LoRA demo comparison |
662-
663635
---
664636
665637
## How to Extend
@@ -726,4 +698,3 @@ cd sdk/runanywhere-kotlin
726698
| 2026-02-19 | Claude | Initial implementation of LoRA adapter support across all 6 layers (C++ through Kotlin public API). C++ desktop build verified. |
727699
| 2026-02-19 | Claude | Fixed architecture: Component layer now dispatches LoRA ops through vtable (`rac_llm_service_ops_t`) instead of calling backend directly. This decouples `librac_commons.so` from `librac_backend_llamacpp.so`. Added 4 vtable entries and wrapper functions. Fixed `AttachCurrentThread` cast for Android NDK C++ build. Android native build verified. |
728700
| 2026-02-19 | Claude | Added detailed Kotlin SDK usage guide with data types, code examples, error handling, Android ViewModel pattern, and table of contents with section links. Updated "How to Extend" to include vtable step. |
729-
| 2026-03-09 | Claude | **LoRA fix & hardening.** Fixed LoRA adapter having no effect — root cause: wrong adapter file (4.3MB generic vs 17.6MB abliterated F16). Updated Android app to use `qwen2.5-0.5b-abliterated-lora-f16.gguf`. Added C++ validation: scale check, GGUF magic verification, tensor match count via `ab_map` (internal header `llama-adapter.h`), adapter metadata logging, pre-generation adapter state verification. Updated API from deprecated `llama_set_adapter_lora` to `llama_set_adapters_lora` (batch API, b8201). Updated docs to reflect llama.cpp b8201 API changes. |

0 commit comments

Comments
 (0)