It has been almost 10 days since Part 1 and I’ve learned a lot, tweaked and benchmarked more times than I can recall, and used Qwen3.6 in as many contexts as I could in my daily dev life.

The Quality

I know and trust Qwen3.6-35B-A3B at this point. I’ve used it for:

  • coding features in my current project (Go)
  • implementing and expanding test suites for multiple projects
  • writing build/CI configs for multiple projects to run on my Forgejo server
  • reviewing code as I’m writing it
  • implement my own MCP tools to run on my server, also written in Go
  • wrote terminal tools/scripts that I now use daily

I’ve been mostly using it through opencode, and sometimes pi.

To reach this level of trust, my stress test was using Qwen3.6-35B-A3B to vibe code a fantasy console that’s NES-like and write two games for it (python/pygame). This task got close to 230k tokens context (on a 262k context max).

There were difficulties beyond ~190k context. Game development is very difficult for an LLM because they can’t see or directly test the result and gameplay is partially a subjective experience anyway, Someone out there probably managed to setup vision and screen-shots to make this a little easier for the LLM, but I haven’t yet. The overall result was satisfactory in my opinion.

I also tested pushing the context to 512k and 1M via rope/yarn scaling and quantizing KV cache using turbo quant (turbo4) as well as utilizing the smallest quality quant (IQ4_NL_XL). The main problem wasn’t the impact to quality, but how much slower generation got (<5 t/s). This should be fine for long autonomous tasks, but normal cooperative development would be painful.

The Quantization

of Model

I got a reply from @landelare suggesting I try APEX quant, I’ve seen the acronym somewhere but I didn’t look into it. Later on I was doing some reading and decided to take a look at the technical report paper for APEX. And it just resonated with my developer mind very loudly, it’s basically a quant that achieves a double optimization: higher quality + smaller size! simply by using different quants for different layers according to their sensitivity to error. Oh and it’s specifically made for MoE models!

Some of the results shown in the paper are almost unreal. It shows that the Quality version of the quant is basically as good as Q6_K/Q8_0. Also its perplexity is very close to the reference FP16 model!

The two high configurations: Quality and Balanced, were a bit confusing. Quality is smaller than Balanced, but it appears to be better!

Either way, heading to the quants created by the researcher, the APEX-I-Quality GGUF quant at 22.8GB was larger than the quant I was using at that point: Unsloth Q4_K_XL at 22.4GB, so I expected it to use more of my already exhausted memory..

Then I started testing APEX-I-Quality and somehow it uses less memory than Q4_K_XL?! I’m not sure why or if I’m just imagining it, but I was able to push the context size further up to 512K tokens with APEX-I-Quality and my system had 5GB of system RAM left. Doing the same using Q4_K_XL, basically exhausted my entire system memory. So I decided to make the switch, and been happy with APEX-I-Quality since.

of KV Cache

I’ve seen some noise in discussion groups about a new KV quant that saves more memory than q8_0 while not losing much accuracy. I was interested in turbo4 specifically, as it means I could possibly push the context to 1M tokens stable! my max with q8_0 was around 640k with IQ4_NL_XL.

turbo4 was only available via a fork, so I got the fork and tested IQ4_NL_XL with context 512K, 768K, and 1M on a 100k+ tokens task, and to my shock, it works and is stable (no memory full crash).

I added turbo4 to my profiles, if I ever need >512K tokens, it’ll be the one. But in my main profiles, I use q8_0 as that’s safer than turbo quants.

The Unexpected

As I mentioned in Part 1, I wasn’t really interested in commercial AI. I never tried Claude, or Github Copilot, or any of the commercialised AI editors like Zed or Codex. Part of why is that I couldn’t afford to pay for them as I haven’t had an income since last year (and wanted to spend some precious time with my parents who I haven’t seen in years). And the other part is that I’ve just been burned too many times by enshitification and “enhancing” my experience by almost every commercial tool I used (aside from Steam, still).

The unexpected result from Qwen3.6, is that it converted me into a potential customer for commercial AI. I have credit in my OpenRouter account and I’m considering subscribing to opencode go, a month ago I had absolutely no interest in either. The reason for the change is that having competent local AI means I can actually evolve my development processes significantly and trust I own and control the main dependency without any compromises.

As of now, I use Qwen3.6 Plus or DeepSeek V4 to do the planning and complex tasks via OpenRouter. And I use my local LLMs as a pair programmer and for code completions. I let the LLMs write any trivial things or expand test suites, and I focus on the more critical things that impact the project’s path forward.

The Fourth DeepSeek

DeepSeek V4 actually deserves its own podium. The number of innovative optimizations and solutions they came up with is staggering. This video does an excellent job covering how the team got V4 to where it’s at:

it makes me fantasize about one day building an AI computer to run a DeepSeek V4 Flash successor at home. V4 Flash needs around 192GB of shared RAM to run at Q4_K_M quality, it’s becoming normal for M5 Macs/Strix Halo PCs to pack 128GB of shared RAM so we’re not that far off.

It can also probably run well on non-shared RAM systems with 192GB system RAM and 24GB of VRAM as it’s an MoE model with 13B active parameters.

I hope the innovations from DeepSeek V4 will trickle down to future compact models..

The Future

Qwen3.5 was released in Feb 15, 2026, which I didn’t even know or care about.

Qwen3.6 models were released between April 1-22, 2026. And it has been the hottest local LLM for me and everywhere else such as r/LocalLlaMA, so as far as I’m concerned my conclusion is on point. Qwen3.6 is the first actually competent compact LLM that can run on average consumer PCs. And I and many others were hungry for this moment.

I also know for a fact that as good as 35B-A3B is, Qwen3.6-27B is even better! I just need a new GPU to run it. That’s gonna be at the top of my list once I have an income. A new GPU or one of the 128GB shared memory PCs (M5, Framework, etc), those could run a future Qwen 122B-A10B model for example but from what I’ve seen, not as fast as an AMD/NVIDIA GPUs due to limited bandwidth mainly.

Btw, just yesterday, Alibaba announced Qwen3.7 preview!, if they do release compact open-weights models that are even a few % better than 35B and 27B, this would be an epic event probably worthy enough to upset some big AI players.

As for frontier models, the future of open weights models never looked this good. MiniMax M2.7, Kimi K2.6, and DeepSeek V4 Flash/Pro are all very competent. I consider DeepSeek V4, Qwen3.6 and Gemma 4 to be part of a new generation of LLMs that are focused on coding/agentic tasks at lower runtime memory requirements. I think the DDR5 RAM shortage caused these innovations indirectly, constraining runtime memory use is an excellent solution and an Uno reverse card against the semi-deliberate shortage.

The Next Destination

At this point, I think I know enough and have enough setup to start getting into running autonomous agents. Hermes is interesting to me, I was thinking today what I’d like it to do. A combo of project manager/build engineer/QA would be epic.

I also want to setup vision support in my Qwen3.6-35B-A3B, I read it’s excellent at that. This would be great for testing/debugging visual aspects of the application or just games. I’d have to implement a way for the game to communicate what it’s doing via its logs so the agent knows when to take a screenshot, or perhaps to create a SKILL that lets the agent tell the game where it wants a screenshot to be taken?

There’s also video support, but don’t want to worry about that now.