Local Qwen - $0 Local-LLM Nightly Pipeline

The problem and the motivation

Working with large language models every day, I noticed a whole category of work that is useful but too expensive to do by hand and too repetitive to be worth paid tokens: checking whether a project's rules still make sense, trimming over-long memory notes, mining the durable lessons out of a day's work transcripts, scanning sources for ideas, hunting for dead code or references that no longer resolve. Each task on its own is small. Together they would cost hours and a real API budget if I ran them on paid cloud models.

The fix was to move all of this auxiliary work onto a local model that runs for free overnight, when neither I nor the GPU am busy. The hardware is a single AMD Ryzen AI Max+ 395 system with 128 GB of unified memory; the model occupies roughly 20 GB on the GPU. Because inference is local, the marginal cost of a night of processing is effectively zero. That changes the calculus entirely: when an operation costs nothing, it is worth running every single night even if it pays off only one time in ten.

The architecture: how it works

The heart of the system is a cron-style orchestrator that walks a 42-job DAG, defined as an ordered list (_JOB_ORDER) plus a function registry (_JOB_MAP). Order matters: a run opens with a self-check job that probes the infrastructure and inference lanes, then moves through memory and rules hygiene, lessons mining, source scanning, deep-read, REPA dossiers on articles, idea generation, code and spec audits, optimization syntheses, and finally the morning digest, verification, and report consolidation.

Startup has two paths. The first is a Windows Task Scheduler task that fires at 03:00 whether or not my working session is open; the runner waits for the system to be idle before it begins. The second is a prompt-submit hook: when I type "going to sleep", "off to bed", or "goodnight", the hook launches the runner manually. The model is never auto-launched by the runner; LM Studio keeps it resident at 127.0.0.1:1234, exposing both the OpenAI-compatible API and the native Anthropic endpoint.

Each job writes its result as a dated Markdown file under REPORTS/qwen_nightly/. A separate generator compiles all of those reports into a single tabbed HTML dashboard, served at a fixed URL and auto-refreshing every 300 seconds. The dashboard has one tab per job, each with freshness badges (green for today, orange with a day count for older content), so a stale report is never presented as current.

The model and the key technical decisions

The chosen model is Qwen3.6-35B-A3B, a mixture-of-experts architecture, at IQ4_XS quantization. The choice came out of a homegrown benchmark harness that compared several quantization variants on speed, memory footprint, and reasoning quality; IQ4_XS won with a good balance of all three and a generous token context, enough to digest entire session transcripts. Because the hardware has a single GPU, a firm rule forbids loading a second language model in parallel; the only exception is the bge-m3 embedding model, which must stay resident because it feeds a separate vector daemon.

An important design decision was to make local Qwen strictly a night job. During the day, auxiliary task classes (summaries, deduplication, field extraction, first drafts) are routed to free cloud lanes such as NVIDIA NIM, cloud-Qwen, or OpenRouter, so they do not tie up the GPU when I need it. The local model only steps in via the night job, which uses it exactly when nothing else is running.

Notable engineering details

The most instructive technical problem was a subtle truncation. Early on, the memory-rewrite job produced only 4 of 20 correct results; the rest were 7-character stubs. The cause, found through a short investigation, was that the inference server did not split the think reasoning block into a dedicated field but stuffed it into the response content. Under the token limit, the reasoning at the start survived while the actual answer, sitting at the end, got cut. The fix had two parts: the client strips think blocks from content before computing the reply, treating an unclosed open think as an empty answer (an honest failure, not garbage), and the system prompts of jobs that need short output include an explicit instruction never to emit think tags at all. The /no_think directive did not work on this model-and-server combination; the explicit instruction was the only reliable suppression.

Anti-fragility is built in several layers. A kill-on-resume watchdog monitors my session activity: if I start working again overnight, the watchdog stops processing and frees the GPU within roughly 30 seconds, and it also enforces a maximum time budget per run. A cooperative GPU lock serializes access to the single card, honoring the no-two-models rule. A URL deduplication cache, persistent across nights, avoids reprocessing the same sources. The execution philosophy is that an error in one job must not stop the run: later jobs continue, and a barrier-style consumer waits only for the producers that actually ran.

To prove the system genuinely works, I wrote a separate verifier with three live signals: whether the model is resident, whether the last run finished with a complete status (not a launch failure), and how fresh the most recent output is. The verdict is WORKING only if the last run was complete and the output is recent. This verifier exists precisely because, at one point, the model launch was failing every iteration due to a bad eviction parameter while the shallow signals looked fine. Alongside it, a nightly verifier job uses a free cloud lane to review the reports before the digest, a telemetry job tracks duration and error trends across nights, and a dedicated job checks backup freshness and the presence of critical files.

Cost and optimizations

Local inference is free; the only possible costs are optional auxiliary calls, such as a paid pre-scan that fires only when an explicit approval file is present. That keeps a night between 0.05 and 0.15 dollars. Beyond money, the system also optimizes its own work: an optional DAG scheduler can parallelize jobs that do not need the GPU, after each job has been classified as GPU-bound or GPU-free. Several jobs are dedicated to continuous improvement itself: a deterministic optimization audit, an LLM synthesis over its output, a multi-agent optimization swarm running on free agents, and a rotating rules audit that always gives the local model fresh work.

The outcome and current status

Since the first real run, the system has generated more than 4,200 advisory reports, with dated files spanning from late May to mid-June 2026. The codebase matured from an initial set of 5 jobs to a 42-job DAG, supported by dozens of specialized scripts. The foundational principle has held from the start: everything is strictly advisory. No job overwrites the memory, rules, or code files without human approval; results land in a dashboard, and I decide what is worth applying. Local Qwen is, in essence, a night-shift colleague who works for free, never tires, and never presses the publish button on its own.