<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>RTX 4070 Ti | Alfonso Fortunato</title><link>https://alfonsofortunato.com/tags/rtx-4070-ti/</link><atom:link href="https://alfonsofortunato.com/tags/rtx-4070-ti/index.xml" rel="self" type="application/rss+xml"/><description>RTX 4070 Ti</description><generator>HugoBlox Kit (https://hugoblox.com)</generator><language>en-us</language><lastBuildDate>Sat, 04 Apr 2026 00:00:00 +0000</lastBuildDate><image><url>https://alfonsofortunato.com/media/logo_hu_2de8902a9a04b271.png</url><title>RTX 4070 Ti</title><link>https://alfonsofortunato.com/tags/rtx-4070-ti/</link></image><item><title>Gemma 4 E4B vs 26B on an RTX 4070 Ti: Benchmarks, RAG, and a Real Webapp Test</title><link>https://alfonsofortunato.com/blog/gemma-4-e4b-vs-26b-local-benchmarks/</link><pubDate>Sat, 04 Apr 2026 00:00:00 +0000</pubDate><guid>https://alfonsofortunato.com/blog/gemma-4-e4b-vs-26b-local-benchmarks/</guid><description>&lt;p&gt;I have been spending more and more time testing local AI lately, but not for privacy reasons. I do not really believe in that argument anymore. The real reason is cost. What I actually want is a cheaper model for lightweight local RAG, web retrieval, and grounded summaries, with enough reasoning and tool use to be useful without reaching for a larger hosted model every time. Simple app generation is still interesting, but it is not the main target. That is why local benchmarks matter more to me than vague claims like &amp;ldquo;this one feels faster&amp;rdquo; or &amp;ldquo;that one is better for local use.&amp;rdquo;&lt;/p&gt;
&lt;p&gt;So I picked two Gemma 4 variants from Unsloth and ran them on my RTX 4070 Ti with &lt;code&gt;llama.cpp&lt;/code&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;gemma-4-E4B-it-GGUF&lt;/code&gt; in &lt;code&gt;Q8_0&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;gemma-4-26B-A4B-it-GGUF&lt;/code&gt; in &lt;code&gt;UD-Q4_K_XL&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;At first I expected the 26B model to be the obvious winner. Bigger context window, bigger vision stack, more total parameters, more of that &amp;ldquo;serious model&amp;rdquo; aura that makes you assume quality must follow. Then I actually benchmarked them. Different story.&lt;/p&gt;
&lt;h2 id="the-setup"&gt;The setup&lt;/h2&gt;
&lt;p&gt;Nothing exotic here. Just a local machine, &lt;code&gt;llama.cpp&lt;/code&gt;, and a 12 GB RTX 4070 Ti.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;CPU: AMD Ryzen 9 7950X3D&lt;/li&gt;
&lt;li&gt;RAM available: 16 GB&lt;/li&gt;
&lt;li&gt;GPU: NVIDIA GeForce RTX 4070 Ti&lt;/li&gt;
&lt;li&gt;Threads used for the benchmark: 16&lt;/li&gt;
&lt;li&gt;Build: &lt;code&gt;llama.cpp&lt;/code&gt; &lt;code&gt;5208e2d5b (8641)&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Prompt benchmarks tested: &lt;code&gt;pp512&lt;/code&gt;, &lt;code&gt;pp4096&lt;/code&gt;, &lt;code&gt;pp8192&lt;/code&gt;, &lt;code&gt;pp16384&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Generation benchmarks tested: &lt;code&gt;tg128&lt;/code&gt;, &lt;code&gt;tg256&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The goal was simple: understand what these models feel like on real local hardware, not on a product page, not in a benchmark chart detached from context, and not on a machine I don&amp;rsquo;t own.&lt;/p&gt;
&lt;h2 id="the-local-flow"&gt;The local flow&lt;/h2&gt;
&lt;p&gt;This is the shape of the setup I ended up with:&lt;/p&gt;
&lt;div class="mermaid"&gt;flowchart LR
A[RTX 4070 Ti workstation] --&gt; B[llama.cpp]
B --&gt; C[llama-bench]
B --&gt; D[llama-cli]
B --&gt; E[llama-server on :8081]
E --&gt; F[OpenAI-compatible endpoint /v1]
E --&gt; G[Anthropic-compatible endpoint /v1/messages]
F --&gt; H[opencode]
G --&gt; I[claude-code]
&lt;/div&gt;
&lt;h2 id="context-length-is-where-things-get-real"&gt;Context length is where things get real&lt;/h2&gt;
&lt;p&gt;The model card numbers are seductive.&lt;/p&gt;
&lt;p&gt;Gemma 4 E4B advertises &lt;code&gt;128K&lt;/code&gt; context. Gemma 4 26B A4B goes to &lt;code&gt;256K&lt;/code&gt;. On paper that sounds like you should just dial the context up and enjoy the extra room. On a 12 GB card, that is not how this plays out.&lt;/p&gt;
&lt;p&gt;In practice, context is not free. The KV cache grows with it, and that cost shows up fast when you are already pushing the card with a model that you want fully offloaded. The model weights are only part of the memory story. The cache becomes the other half of the budget.&lt;/p&gt;
&lt;p&gt;That is why the practical baseline commands in this post use:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;-c &lt;span class="m"&gt;65536&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Not because Gemma 4 is limited to 64K. It clearly is not. Because &lt;code&gt;65536&lt;/code&gt; is the lowest setting that feels meaningfully &amp;ldquo;long context&amp;rdquo; to me on this setup while still being worth testing locally. It is serious enough to expose the real tradeoffs without pretending the advertised maximum and the practical maximum are the same thing.&lt;/p&gt;
&lt;p&gt;Later in the post I use that same &lt;code&gt;64K&lt;/code&gt; setting for the real web app generation test, because at this point that is the minimum context window I actually care about for local experimentation.&lt;/p&gt;
&lt;h2 id="kv-cache-considerations-on-a-4070-ti"&gt;KV cache considerations on a 4070 Ti&lt;/h2&gt;
&lt;p&gt;This is the part worth saying plainly.&lt;/p&gt;
&lt;p&gt;If you push context too far, you usually stop talking about model quality and start talking about memory pressure, slower startup, reduced headroom, and awkward tradeoffs between context size and GPU offload. None of that is glamorous, but all of it matters when you are trying to make a local setup pleasant enough to use every day.&lt;/p&gt;
&lt;p&gt;The safest mental model is this:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;model weights decide whether the model fits at all&lt;/li&gt;
&lt;li&gt;KV cache decides how far you can push context before the setup becomes annoying&lt;/li&gt;
&lt;li&gt;full offload and large context are often competing goals on mid-range hardware&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;For my 4070 Ti setup, the sweet spot is not &amp;ldquo;maximum context.&amp;rdquo; It is &amp;ldquo;enough context to be useful without wrecking latency or starving VRAM.&amp;rdquo; If I were setting up a real local workflow instead of a synthetic benchmark, &lt;code&gt;64K&lt;/code&gt; is the minimum where I would start testing E4B seriously. That is where the long-context promise starts to feel real. It is also exactly where you need to watch KV cache behavior instead of assuming the card will stay comfortable.&lt;/p&gt;
&lt;p&gt;The 26B model makes this even more obvious. It already asks much more from the machine, so large-context experimentation becomes even less forgiving. The longer window is real, but whether it is practical on your local box is a separate question.&lt;/p&gt;
&lt;h2 id="what-these-two-models-actually-are"&gt;What these two models actually are&lt;/h2&gt;
&lt;p&gt;Before the numbers, the models.&lt;/p&gt;
&lt;figure&gt;&lt;img src="https://alfonsofortunato.com/blog/gemma-4-e4b-vs-26b-local-benchmarks/gemma-4-logo.png"
alt="Official Gemma 4 visual from the Google DeepMind model page"&gt;&lt;figcaption&gt;
&lt;p&gt;Gemma 4, as presented on the official Google DeepMind model page.&lt;/p&gt;
&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;p&gt;I pulled the high-level specs from the Unsloth model cards because raw benchmark data without architecture context is half a story. Maybe less.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Property&lt;/th&gt;
&lt;th&gt;Gemma 4 E4B&lt;/th&gt;
&lt;th&gt;Gemma 4 26B A4B&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Architecture&lt;/td&gt;
&lt;td&gt;Dense&lt;/td&gt;
&lt;td&gt;Mixture-of-Experts&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Effective or active parameters&lt;/td&gt;
&lt;td&gt;4.5B effective&lt;/td&gt;
&lt;td&gt;3.8B active&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Total parameters&lt;/td&gt;
&lt;td&gt;8B with embeddings&lt;/td&gt;
&lt;td&gt;25.2B total&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Layers&lt;/td&gt;
&lt;td&gt;42&lt;/td&gt;
&lt;td&gt;30&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Sliding window&lt;/td&gt;
&lt;td&gt;512 tokens&lt;/td&gt;
&lt;td&gt;1024 tokens&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Context length&lt;/td&gt;
&lt;td&gt;128K&lt;/td&gt;
&lt;td&gt;256K&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Modalities&lt;/td&gt;
&lt;td&gt;Text, Image, Audio&lt;/td&gt;
&lt;td&gt;Text, Image&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Vision encoder&lt;/td&gt;
&lt;td&gt;~150M&lt;/td&gt;
&lt;td&gt;~550M&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Audio encoder&lt;/td&gt;
&lt;td&gt;~300M&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;That already makes the choice more interesting than a lazy &amp;ldquo;small model vs big model&amp;rdquo; comparison.&lt;/p&gt;
&lt;p&gt;The E4B is the compact, dense model. It is clearly designed for local and lighter deployments, and it even includes audio support. The 26B A4B is a different beast: a larger MoE model, much longer context, bigger multimodal stack, and the kind of profile that suggests better ceiling, not better speed.&lt;/p&gt;
&lt;p&gt;That distinction matters. A lot.&lt;/p&gt;
&lt;h2 id="the-first-benchmark-pass"&gt;The first benchmark pass&lt;/h2&gt;
&lt;p&gt;Here are the initial results:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Quant&lt;/th&gt;
&lt;th style="text-align: right"&gt;Size&lt;/th&gt;
&lt;th style="text-align: right"&gt;GPU layers&lt;/th&gt;
&lt;th style="text-align: right"&gt;&lt;code&gt;pp512&lt;/code&gt; t/s&lt;/th&gt;
&lt;th style="text-align: right"&gt;&lt;code&gt;tg128&lt;/code&gt; t/s&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Gemma 4 E4B&lt;/td&gt;
&lt;td&gt;&lt;code&gt;Q8_0&lt;/code&gt;&lt;/td&gt;
&lt;td style="text-align: right"&gt;7.62 GiB&lt;/td&gt;
&lt;td style="text-align: right"&gt;33&lt;/td&gt;
&lt;td style="text-align: right"&gt;3157.39 ± 332.12&lt;/td&gt;
&lt;td style="text-align: right"&gt;27.86 ± 0.49&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gemma 4 26B A4B&lt;/td&gt;
&lt;td&gt;&lt;code&gt;UD-Q4_K_XL&lt;/code&gt;&lt;/td&gt;
&lt;td style="text-align: right"&gt;15.95 GiB&lt;/td&gt;
&lt;td style="text-align: right"&gt;999&lt;/td&gt;
&lt;td style="text-align: right"&gt;332.80 ± 12.20&lt;/td&gt;
&lt;td style="text-align: right"&gt;13.70 ± 0.19&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;figure&gt;&lt;img src="https://alfonsofortunato.com/blog/gemma-4-e4b-vs-26b-local-benchmarks/local-bench.png"
alt="Local benchmark screenshot comparing Gemma 4 runs"&gt;&lt;figcaption&gt;
&lt;p&gt;The first local benchmark pass on the RTX 4070 Ti.&lt;/p&gt;
&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;p&gt;The gap was not subtle.&lt;/p&gt;
&lt;p&gt;On prompt processing, E4B was about &lt;code&gt;9.49x&lt;/code&gt; faster. On generation, it was about &lt;code&gt;2.03x&lt;/code&gt; faster. Even before digging deeper, that already told me something practical: on this hardware, the smaller model is dramatically easier to live with.&lt;/p&gt;
&lt;p&gt;And that matters more than people admit.&lt;/p&gt;
&lt;p&gt;A local model that answers quickly changes how often you use it. It becomes something you reach for between tasks, during debugging, while sketching an idea, or when you want a fast second opinion in the terminal. A slower model might be more capable in some scenarios, but if it drags every interaction down, it stops being a tool and starts becoming an event.&lt;/p&gt;
&lt;h2 id="the-trap-i-hit-with-gpu-layers"&gt;The trap I hit with GPU layers&lt;/h2&gt;
&lt;p&gt;This is where the benchmark got more useful.&lt;/p&gt;
&lt;p&gt;I wanted to rerun E4B with &lt;code&gt;GPU_LAYERS=999&lt;/code&gt; to see how much it benefited from full offload. The script looked ready for it. It had &lt;code&gt;GPU_LAYERS&lt;/code&gt; support at the top, so I expected the override to work.&lt;/p&gt;
&lt;p&gt;It didn&amp;rsquo;t.&lt;/p&gt;
&lt;p&gt;The wrapper was hardcoded to run &lt;code&gt;bench-e4b&lt;/code&gt; with &lt;code&gt;33&lt;/code&gt; GPU layers, which meant my first E4B benchmark understated what the model could actually do on this GPU. That kind of thing is exactly why I like testing locally instead of trusting my assumptions. One overlooked line in a shell script can invalidate the conclusion you thought you were making.&lt;/p&gt;
&lt;p&gt;So I ran &lt;code&gt;llama-bench&lt;/code&gt; directly with &lt;code&gt;-ngl 999&lt;/code&gt;.&lt;/p&gt;
&lt;h2 id="e4b-at-33-layers-vs-999-layers"&gt;E4B at 33 layers vs 999 layers&lt;/h2&gt;
&lt;p&gt;Here is the comparison that changed the post.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;E4B run&lt;/th&gt;
&lt;th style="text-align: right"&gt;&lt;code&gt;ngl&lt;/code&gt;&lt;/th&gt;
&lt;th style="text-align: right"&gt;&lt;code&gt;pp512&lt;/code&gt; t/s&lt;/th&gt;
&lt;th style="text-align: right"&gt;&lt;code&gt;tg128&lt;/code&gt; t/s&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Script default&lt;/td&gt;
&lt;td style="text-align: right"&gt;33&lt;/td&gt;
&lt;td style="text-align: right"&gt;3157.39 ± 332.12&lt;/td&gt;
&lt;td style="text-align: right"&gt;27.86 ± 0.49&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Direct full offload test&lt;/td&gt;
&lt;td style="text-align: right"&gt;999&lt;/td&gt;
&lt;td style="text-align: right"&gt;6757.30 ± 1777.85&lt;/td&gt;
&lt;td style="text-align: right"&gt;69.66 ± 1.30&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;That is not a rounding error. That is a different user experience.&lt;/p&gt;
&lt;p&gt;Moving E4B from &lt;code&gt;ngl=33&lt;/code&gt; to &lt;code&gt;ngl=999&lt;/code&gt; improved prompt processing by about &lt;code&gt;2.14x&lt;/code&gt; and generation by about &lt;code&gt;2.50x&lt;/code&gt;. In plain terms: once fully offloaded, E4B stopped looking merely &amp;ldquo;good for a small model&amp;rdquo; and started looking genuinely pleasant to use.&lt;/p&gt;
&lt;p&gt;This also sharpens the comparison with 26B.&lt;/p&gt;
&lt;p&gt;If I compare the best local E4B run against the 26B run on the same machine:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;E4B is about &lt;code&gt;20.3x&lt;/code&gt; faster on prompt processing&lt;/li&gt;
&lt;li&gt;E4B is about &lt;code&gt;5.1x&lt;/code&gt; faster on generation&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;That does not prove E4B is the better model in every sense. It proves something more useful: for this class of hardware, full-offload E4B sits in a very attractive spot between speed, model size, and local usability.&lt;/p&gt;
&lt;h2 id="a-more-realistic-long-context-benchmark"&gt;A more realistic long-context benchmark&lt;/h2&gt;
&lt;p&gt;The original &lt;code&gt;pp512&lt;/code&gt; and &lt;code&gt;tg128&lt;/code&gt; numbers are useful for comparing raw throughput, but they are still synthetic. Real local usage looks more like pasted documentation, longer chats, repo summaries, and prompts big enough to put pressure on the KV cache.&lt;/p&gt;
&lt;p&gt;So I ran a second pass with:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;pp4096&lt;/code&gt; + &lt;code&gt;tg256&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;pp8192&lt;/code&gt; + &lt;code&gt;tg256&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;pp16384&lt;/code&gt; + &lt;code&gt;tg256&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Here is what happened.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th style="text-align: right"&gt;Prompt size&lt;/th&gt;
&lt;th style="text-align: right"&gt;Prompt processing t/s&lt;/th&gt;
&lt;th style="text-align: right"&gt;Generation t/s&lt;/th&gt;
&lt;th&gt;Status&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;E4B&lt;/td&gt;
&lt;td style="text-align: right"&gt;4096&lt;/td&gt;
&lt;td style="text-align: right"&gt;7117.83 ± 7.33&lt;/td&gt;
&lt;td style="text-align: right"&gt;70.85 ± 0.06&lt;/td&gt;
&lt;td&gt;completed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;E4B&lt;/td&gt;
&lt;td style="text-align: right"&gt;8192&lt;/td&gt;
&lt;td style="text-align: right"&gt;6720.76 ± 14.76&lt;/td&gt;
&lt;td style="text-align: right"&gt;70.83 ± 0.20&lt;/td&gt;
&lt;td&gt;completed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;E4B&lt;/td&gt;
&lt;td style="text-align: right"&gt;16384&lt;/td&gt;
&lt;td style="text-align: right"&gt;5992.66 ± 4.84&lt;/td&gt;
&lt;td style="text-align: right"&gt;70.84 ± 0.10&lt;/td&gt;
&lt;td&gt;completed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;26B A4B&lt;/td&gt;
&lt;td style="text-align: right"&gt;4096&lt;/td&gt;
&lt;td style="text-align: right"&gt;323.74 ± 0.67&lt;/td&gt;
&lt;td style="text-align: right"&gt;14.88 ± 0.02&lt;/td&gt;
&lt;td&gt;completed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;26B A4B&lt;/td&gt;
&lt;td style="text-align: right"&gt;8192&lt;/td&gt;
&lt;td style="text-align: right"&gt;293.09 ± 0.88&lt;/td&gt;
&lt;td style="text-align: right"&gt;15.07 ± 0.18&lt;/td&gt;
&lt;td&gt;completed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;26B A4B&lt;/td&gt;
&lt;td style="text-align: right"&gt;16384&lt;/td&gt;
&lt;td style="text-align: right"&gt;268.06 ± 3.03&lt;/td&gt;
&lt;td style="text-align: right"&gt;not completed&lt;/td&gt;
&lt;td&gt;not practically runnable to full completion on this setup&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;This is the part I care about most.&lt;/p&gt;
&lt;p&gt;E4B degraded gracefully as prompt size increased. Prompt ingestion dropped from roughly &lt;code&gt;7118 t/s&lt;/code&gt; at &lt;code&gt;4K&lt;/code&gt; to &lt;code&gt;5993 t/s&lt;/code&gt; at &lt;code&gt;16K&lt;/code&gt;, but generation stayed essentially flat at around &lt;code&gt;70.8 t/s&lt;/code&gt;. That is exactly the kind of behavior you want from a local model that is supposed to stay usable as context grows.&lt;/p&gt;
&lt;p&gt;The 26B model told a harsher story. &lt;code&gt;4K&lt;/code&gt; and &lt;code&gt;8K&lt;/code&gt; completed, but much more slowly, and the &lt;code&gt;16K&lt;/code&gt; run only gave me the prompt-processing result before the full generation side stopped being practical for this workflow. That does not mean the model is broken. It means the combination of model size, context pressure, and local hardware constraints crosses the line from &amp;ldquo;interesting benchmark&amp;rdquo; into &amp;ldquo;not something I would actually want to sit through.&amp;rdquo;&lt;/p&gt;
&lt;p&gt;One important caveat: this was a long-context benchmark story, not a universal 26B verdict. In a fresh short-context &lt;code&gt;llama-cli&lt;/code&gt; run, the 26B model was much healthier, with roughly &lt;code&gt;395.5 t/s&lt;/code&gt; on prompt ingestion and about &lt;code&gt;34.0 t/s&lt;/code&gt; on generation. That is a perfectly usable local result. So the real distinction is not &amp;ldquo;26B is always slow.&amp;rdquo; It is &amp;ldquo;26B is fine on short context, then becomes much harder to trust once the context window starts growing.&amp;rdquo;&lt;/p&gt;
&lt;p&gt;That is the real takeaway from the long-context pass: advertised context windows are one thing, runnable local context is another.&lt;/p&gt;
&lt;h2 id="how-to-run-this-locally"&gt;How to run this locally&lt;/h2&gt;
&lt;p&gt;This is the part I always want in posts like this and usually don&amp;rsquo;t get: the exact commands.&lt;/p&gt;
&lt;p&gt;Assumptions:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;llama.cpp&lt;/code&gt; is built in &lt;code&gt;~/llama.cpp&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;you have enough VRAM to fully offload E4B&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="benchmark-commands"&gt;Benchmark commands&lt;/h3&gt;
&lt;p&gt;Run the E4B benchmark with full offload:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nv"&gt;LLAMA_CACHE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;~/ai-experimenting/gemma4/models/gemma4_e4b_q8_0 &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;~/llama.cpp/build/bin/llama-bench &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; -hf unsloth/gemma-4-E4B-it-GGUF:Q8_0 &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; -ngl &lt;span class="m"&gt;999&lt;/span&gt; &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; -t &lt;span class="m"&gt;16&lt;/span&gt; &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; -p &lt;span class="m"&gt;512&lt;/span&gt; &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; -n &lt;span class="m"&gt;128&lt;/span&gt; &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; -d &lt;span class="m"&gt;0&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Run the 26B benchmark:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;~/llama.cpp/build/bin/llama-bench &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; -m ~/unsloth/gemma-4-26B-A4B-it-GGUF/gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; -ngl &lt;span class="m"&gt;999&lt;/span&gt; &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; -t &lt;span class="m"&gt;16&lt;/span&gt; &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; -p &lt;span class="m"&gt;512&lt;/span&gt; &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; -n &lt;span class="m"&gt;128&lt;/span&gt; &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; -d &lt;span class="m"&gt;0&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;h3 id="interactive-chat-commands"&gt;Interactive chat commands&lt;/h3&gt;
&lt;p&gt;Chat with E4B:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;~/llama.cpp/build/bin/llama-cli &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; -hf unsloth/gemma-4-E4B-it-GGUF:Q8_0 &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; -ngl &lt;span class="m"&gt;999&lt;/span&gt; &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; -t &lt;span class="m"&gt;16&lt;/span&gt; &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; -c &lt;span class="m"&gt;65536&lt;/span&gt; &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; --temp 1.0 &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; --top-p 0.95 &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; --top-k &lt;span class="m"&gt;64&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Chat with 26B:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;~/llama.cpp/build/bin/llama-cli &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; -m ~/unsloth/gemma-4-26B-A4B-it-GGUF/gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; --mmproj ~/unsloth/gemma-4-26B-A4B-it-GGUF/mmproj-BF16.gguf &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; -ngl &lt;span class="m"&gt;999&lt;/span&gt; &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; -t &lt;span class="m"&gt;16&lt;/span&gt; &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; -c &lt;span class="m"&gt;65536&lt;/span&gt; &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; --temp 1.0 &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; --top-p 0.95 &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; --top-k &lt;span class="m"&gt;64&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Serve 26B with &lt;code&gt;llama-server&lt;/code&gt;:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;./llama.cpp/llama-server &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; -m /home/moviemaker/unsloth/gemma-4-26B-A4B-it-GGUF/gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; --mmproj /home/moviemaker/unsloth/gemma-4-26B-A4B-it-GGUF/mmproj-BF16.gguf &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; -t &lt;span class="m"&gt;16&lt;/span&gt; &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; -c &lt;span class="m"&gt;65536&lt;/span&gt; &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; --temp 1.0 &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; --top-p 0.95 &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; --top-k &lt;span class="m"&gt;64&lt;/span&gt; &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; --alias &lt;span class="s2"&gt;&amp;#34;unsloth/gemma-4-26B-A4B-it-GGUF&amp;#34;&lt;/span&gt; &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; --port &lt;span class="m"&gt;8001&lt;/span&gt; &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; --chat-template-kwargs &lt;span class="s1"&gt;&amp;#39;{&amp;#34;enable_thinking&amp;#34;:true}&amp;#39;&lt;/span&gt; &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; --jinja
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;On my current setup, this worked better than forcing &lt;code&gt;-ngl&lt;/code&gt;. The 26B run was simply more reliable once I stopped trying to hardcode GPU layers.&lt;/p&gt;
&lt;figure&gt;&lt;img src="https://alfonsofortunato.com/blog/gemma-4-e4b-vs-26b-local-benchmarks/llama-ccp-example-chat.png"
alt="Example of chatting locally with Gemma 4 E4B through llama.cpp"&gt;&lt;figcaption&gt;
&lt;p&gt;An example local chat session with Gemma 4 E4B running through llama.cpp.&lt;/p&gt;
&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;p&gt;If you are tempted to increase &lt;code&gt;-c&lt;/code&gt; aggressively beyond &lt;code&gt;64K&lt;/code&gt;, do it one step at a time. On paper, long context looks like a free upgrade. Locally, it is usually where KV cache pressure starts dictating the experience.&lt;/p&gt;
&lt;h2 id="how-to-wire-it-into-opencode"&gt;How to wire it into opencode&lt;/h2&gt;
&lt;p&gt;This one is straightforward.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;llama-server&lt;/code&gt; exposes an OpenAI-compatible API on &lt;code&gt;/v1&lt;/code&gt;, and &lt;code&gt;opencode&lt;/code&gt; already supports OpenAI-compatible providers. In fact, this is very close to the config I already have on my machine:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-json" data-lang="json"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nt"&gt;&amp;#34;$schema&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;https://opencode.ai/config.json&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nt"&gt;&amp;#34;provider&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nt"&gt;&amp;#34;local-llama&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nt"&gt;&amp;#34;npm&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;@ai-sdk/openai-compatible&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nt"&gt;&amp;#34;name&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;Local Llama&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nt"&gt;&amp;#34;options&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nt"&gt;&amp;#34;baseURL&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;http://127.0.0.1:8081/v1&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="p"&gt;},&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nt"&gt;&amp;#34;models&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nt"&gt;&amp;#34;gemma-4-e4b&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nt"&gt;&amp;#34;name&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;Gemma 4 E4B&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nt"&gt;&amp;#34;maxContext&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;64000&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Save that in:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;~/.config/opencode/config.json
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Then start your server first:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;~/llama.cpp/build/bin/llama-server &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; -hf unsloth/gemma-4-E4B-it-GGUF:Q8_0 &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; -ngl &lt;span class="m"&gt;999&lt;/span&gt; &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; -t &lt;span class="m"&gt;16&lt;/span&gt; &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; -c &lt;span class="m"&gt;65536&lt;/span&gt; &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; --jinja &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; --temp 1.0 &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; --top-p 0.95 &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; --top-k &lt;span class="m"&gt;64&lt;/span&gt; &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; --alias &lt;span class="s2"&gt;&amp;#34;unsloth/gemma-4-E4B-it-GGUF&amp;#34;&lt;/span&gt; &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; --port &lt;span class="m"&gt;8081&lt;/span&gt; &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; --chat-template-kwargs &lt;span class="s1"&gt;&amp;#39;{&amp;#34;enable_thinking&amp;#34;:true}&amp;#39;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;And launch &lt;code&gt;opencode&lt;/code&gt; in another terminal:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;opencode
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;The important bit is the &lt;code&gt;baseURL&lt;/code&gt;. Once &lt;code&gt;llama-server&lt;/code&gt; is up, &lt;code&gt;opencode&lt;/code&gt; can treat it like any other OpenAI-compatible backend.&lt;/p&gt;
&lt;h2 id="a-more-realistic-64k-test-prompt"&gt;A more realistic 64K test prompt&lt;/h2&gt;
&lt;p&gt;Benchmarks are useful, but they only get you so far.&lt;/p&gt;
&lt;p&gt;The more honest test is a real task with a large context budget and a visible output. I ran this through &lt;code&gt;opencode&lt;/code&gt; against the local E4B server, so it makes more sense to show the wiring first and the prompt second. For E4B, this is the &lt;code&gt;64K&lt;/code&gt; setup I used:&lt;/p&gt;
&lt;p&gt;First, download and serve the model locally with thinking enabled:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;hf download unsloth/gemma-4-E4B-it-GGUF &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; --local-dir ~/unsloth/gemma-4-E4B-it-GGUF &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; --include &lt;span class="s2"&gt;&amp;#34;gemma-4-E4B-it-Q8_0.gguf&amp;#34;&lt;/span&gt; &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; --include &lt;span class="s2"&gt;&amp;#34;mmproj-BF16.gguf&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;./llama.cpp/llama-server &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; -m unsloth/gemma-4-E4B-it-GGUF/gemma-4-E4B-it-Q8_0.gguf &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; --mmproj unsloth/gemma-4-E4B-it-GGUF/mmproj-BF16.gguf &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; --temp 1.0 &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; --top-p 0.95 &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; --top-k &lt;span class="m"&gt;64&lt;/span&gt; &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; --alias &lt;span class="s2"&gt;&amp;#34;unsloth/gemma-4-E4B-it-GGUF&amp;#34;&lt;/span&gt; &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; --port &lt;span class="m"&gt;8001&lt;/span&gt; &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; --chat-template-kwargs &lt;span class="s1"&gt;&amp;#39;{&amp;#34;enable_thinking&amp;#34;:true}&amp;#39;&lt;/span&gt; &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; --jinja
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;For the 26B comparison run, I used the same general setup pattern with the larger model:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;hf download unsloth/gemma-4-26B-A4B-it-GGUF &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; --local-dir ~/unsloth/gemma-4-26B-A4B-it-GGUF &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; --include &lt;span class="s2"&gt;&amp;#34;gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf&amp;#34;&lt;/span&gt; &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; --include &lt;span class="s2"&gt;&amp;#34;mmproj-BF16.gguf&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;./llama.cpp/llama-server &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; -m unsloth/gemma-4-26B-A4B-it-GGUF/gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; --mmproj unsloth/gemma-4-26B-A4B-it-GGUF/mmproj-BF16.gguf &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; --temp 1.0 &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; --top-p 0.95 &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; --top-k &lt;span class="m"&gt;64&lt;/span&gt; &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; --alias &lt;span class="s2"&gt;&amp;#34;unsloth/gemma-4-26B-A4B-it-GGUF&amp;#34;&lt;/span&gt; &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; --port &lt;span class="m"&gt;8001&lt;/span&gt; &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; --chat-template-kwargs &lt;span class="s1"&gt;&amp;#39;{&amp;#34;enable_thinking&amp;#34;:true}&amp;#39;&lt;/span&gt; &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; --jinja
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-text" data-lang="text"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;Create a production-ready landing page web app for a futuristic architecture studio called MONOLITH.
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;Requirements:
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;- Use a premium editorial visual style, not a generic startup template.
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;- The design should feel cinematic, expensive, and highly art-directed.
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;- Use strong typography, layered backgrounds, elegant spacing, and a memorable layout.
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;- Include subtle but polished animations and microinteractions.
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;- Make it fully responsive for desktop, tablet, and mobile.
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;- Include these sections:
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; - Hero
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; - Studio manifesto
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; - Featured projects
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; - Design process
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; - Testimonials
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; - Contact / call to action
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;- Add interactive elements such as hover states, animated cards, smooth scrolling, or section reveals.
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;- The page should feel custom-designed for an architecture brand, with references to space, material, light, geometry, and form.
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;- Avoid placeholder-looking UI.
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;- Avoid generic gradients and generic SaaS patterns.
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;- Use a clear design system with reusable spacing, colors, and type styles.
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;Technical requirements:
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;- Return production-ready code.
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;- Build the whole app in a single self-contained file if possible.
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;- Use clean, maintainable structure.
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;- Do not include reasoning, hidden thinking, or explanation.
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;- Output only the final code.
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;I like this prompt because it is not a toy. It asks for structure, visual taste, motion, responsiveness, and enough implementation detail that weaker local models usually collapse into something generic.&lt;/p&gt;
&lt;p&gt;I also learned something useful from this test. Both webapp runs were done with thinking enabled. That made the comparison fair, but it also reminded me how quickly the context budget can disappear once you are asking a local model to reason and generate at the same time. It is one more reason I care more about practical local RAG and grounded summarization than about pushing these models into elaborate front-end generation loops.&lt;/p&gt;
&lt;p&gt;It also reflects how I would actually use a local model: not just for chat, but for something substantial enough to show whether the latency and context settings still feel acceptable.&lt;/p&gt;
&lt;h2 id="what-the-two-models-actually-produced"&gt;What the two models actually produced&lt;/h2&gt;
&lt;p&gt;First, the &lt;code&gt;MONOLITH&lt;/code&gt; prompt generated with Gemma 4 E4B.&lt;/p&gt;
&lt;video controls playsinline preload="metadata" width="100%"&gt;
&lt;source src="monolith-webapp-demo.mp4" type="video/mp4"&gt;
&lt;/video&gt;
&lt;p&gt;Then the same prompt again, this time with Gemma 4 26B A4B on the same local setup.&lt;/p&gt;
&lt;video controls playsinline preload="metadata" width="100%"&gt;
&lt;source src="monolith-webapp-demo-26b.mp4" type="video/mp4"&gt;
&lt;/video&gt;
&lt;p&gt;This is where the tables stop helping and the result matters. On this prompt, the 26B output is plainly better. E4B is still the faster and more practical local default, but the bigger model produced the stronger webapp result on my current setup.&lt;/p&gt;
&lt;p&gt;The E4B run also exposed a real limitation in my current setup. The image path was broken there, so I would not use that E4B setup as-is for production front-end generation that depends on visual inputs. That did not decide the whole comparison, but it did push me back toward the use case I care about more anyway: local RAG and agentic tool use, where the hard part is grounding, retrieval, and execution.&lt;/p&gt;
&lt;p&gt;There was also a more ordinary tooling weakness in the E4B run on my current setup. Even with edit permissions already allowed, I still had to tell it multiple times to actually write files. That is the kind of friction that matters in practice. A local model does not need to be perfect, but it does need to stop making you repeat obvious instructions.&lt;/p&gt;
&lt;h2 id="how-to-wire-it-into-claude-code"&gt;How to wire it into Claude Code&lt;/h2&gt;
&lt;p&gt;This one is more experimental, so I want to be precise.&lt;/p&gt;
&lt;p&gt;According to the &lt;code&gt;llama.cpp&lt;/code&gt; server docs, &lt;code&gt;llama-server&lt;/code&gt; exposes an Anthropic-compatible &lt;code&gt;POST /v1/messages&lt;/code&gt; endpoint. According to Anthropic&amp;rsquo;s Claude Code gateway docs, Claude Code can be pointed at a custom Anthropic-style base URL through &lt;code&gt;ANTHROPIC_BASE_URL&lt;/code&gt;. That means you can try routing Claude Code through your local &lt;code&gt;llama-server&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;The practical setup looks like this:&lt;/p&gt;
&lt;p&gt;Start the server with &lt;code&gt;--jinja&lt;/code&gt; enabled:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nv"&gt;LLAMA_CACHE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;~/ai-experimenting/gemma4/models/gemma4_e4b_q8_0 &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;~/llama.cpp/build/bin/llama-server &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; -hf unsloth/gemma-4-E4B-it-GGUF:Q8_0 &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; --alias gemma-4-e4b &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; -ngl &lt;span class="m"&gt;999&lt;/span&gt; &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; -t &lt;span class="m"&gt;16&lt;/span&gt; &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; -c &lt;span class="m"&gt;65536&lt;/span&gt; &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; --jinja &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; --port &lt;span class="m"&gt;8081&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Then in another shell:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nb"&gt;export&lt;/span&gt; &lt;span class="nv"&gt;ANTHROPIC_BASE_URL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;http://127.0.0.1:8081
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nb"&gt;export&lt;/span&gt; &lt;span class="nv"&gt;ANTHROPIC_AUTH_TOKEN&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;local-llama
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nb"&gt;export&lt;/span&gt; &lt;span class="nv"&gt;ANTHROPIC_MODEL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;gemma-4-e4b
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;claude --dangerously-skip-permissions
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Or for a one-shot prompt:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nv"&gt;ANTHROPIC_BASE_URL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;http://127.0.0.1:8081 &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nv"&gt;ANTHROPIC_AUTH_TOKEN&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;local-llama &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nv"&gt;ANTHROPIC_MODEL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;gemma-4-e4b &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;claude -p &lt;span class="s2"&gt;&amp;#34;Summarize this repository and tell me where the tests live&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Two caveats.&lt;/p&gt;
&lt;p&gt;First, this is an inference built from two compatibility layers, not a setup Anthropic explicitly documents for &lt;code&gt;llama.cpp&lt;/code&gt;. Claude Code officially documents custom Anthropic-style gateways. &lt;code&gt;llama.cpp&lt;/code&gt; documents an Anthropic-compatible messages endpoint that &amp;ldquo;suffices to support many apps.&amp;rdquo; Those two facts make the experiment reasonable, but not guaranteed.&lt;/p&gt;
&lt;p&gt;Second, I would expect basic prompting to work more reliably than full agentic tool loops. If your goal is local coding assistance with minimal fuss, &lt;code&gt;opencode&lt;/code&gt; is the cleaner fit today because its configuration is explicitly OpenAI-compatible and &lt;code&gt;llama-server&lt;/code&gt; exposes that path very cleanly.&lt;/p&gt;
&lt;h2 id="so-which-one-would-i-actually-use"&gt;So which one would I actually use?&lt;/h2&gt;
&lt;p&gt;For day-to-day local work on a 4070 Ti, I would still pick E4B first.&lt;/p&gt;
&lt;p&gt;Not because the 26B model is bad. It isn&amp;rsquo;t. The 26B A4B has the bigger context window, the bigger visual stack, and the kind of architecture that may well reward you in harder tasks where answer quality matters more than latency. If I were evaluating deeper reasoning, longer-context retrieval, or multimodal tasks where the extra capacity might show up clearly, I would still keep it around.&lt;/p&gt;
&lt;p&gt;And to be fair to the bigger model, in my current setup it also needed less steering. The 26B run was more likely to understand the direction with fewer follow-up corrections, even if the overall interaction stayed slower. The tooling path still was not ideal there either.&lt;/p&gt;
&lt;p&gt;But most local workflows are not that romantic. They are messy. Repetitive. Interrupt-driven. You ask a question, try a prompt, adjust, rerun, compare, move on. Speed wins those loops.&lt;/p&gt;
&lt;p&gt;That is where E4B still makes sense.&lt;/p&gt;
&lt;p&gt;It is small enough to fit comfortably. Fast enough to feel responsive. And once I tested it with full offload, it became obvious that I had almost underrated the model.&lt;/p&gt;
&lt;p&gt;That is a very normal local-LLM mistake. One bad setting and you end up benchmarking your wrapper instead of the model.&lt;/p&gt;
&lt;h2 id="what-i-took-away-from-this"&gt;What I took away from this&lt;/h2&gt;
&lt;p&gt;Three things.&lt;/p&gt;
&lt;p&gt;First, benchmark the thing you are actually running, not the thing you think you configured. In my case, one hardcoded shell argument changed the story.&lt;/p&gt;
&lt;p&gt;Second, local model choice is not just about parameter count or architecture prestige. It is about interaction quality. If a model is fast enough to stay in your loop, you will learn its strengths. If it is slow enough to annoy you, you will quietly stop using it.&lt;/p&gt;
&lt;p&gt;Third, E4B is the one I would recommend first for this GPU tier if your real goal looks like mine: lightweight local RAG, fetching information from the web, summarizing it well, and handling smaller tool-using tasks without turning every query into an expensive hosted-model decision.&lt;/p&gt;
&lt;p&gt;The next steps are a local RAG test and then an OpenClaw run. After seeing the image path break in the webapp experiment, that feels like the more useful question anyway: can E4B retrieve the right context, summarize it cleanly, stay grounded once web fetching enters the picture, and still hold up once tool use, state, and longer agentic loops enter the picture?&lt;/p&gt;
&lt;p&gt;Give the 26B model a shot if you are exploring quality ceilings, longer context, or multimodal work where the larger stack may matter. But if your priority is a local assistant that feels snappy in &lt;code&gt;llama.cpp&lt;/code&gt; and is mainly there to help with RAG, web retrieval, and summarization, E4B is where I would start.&lt;/p&gt;
&lt;p&gt;At least now the next benchmark should be measuring the model, not my mistake.&lt;/p&gt;
&lt;h2 id="sources"&gt;Sources&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;/li&gt;
&lt;/ul&gt;</description></item></channel></rss>