Projects done with the help of Average Intelligence tools have "Sloppy" in their name. Others never touched.
Of course "AI" is a great to get stuff done fast, but it's quite dumb and have to be very carefully guided.
Some people compare it to a parrot facerolling on the keyboard.
LLM is not even a neural network, it's an autocomplete dictionary for T9 text predictions just like in old phones.
Repeatedly tap on your phone's text predictions - this is the current state of "AI".
Now with proper expectations you're ready to start building.
Oh, BTW. If you don't want to feed money to cloud services - start with your own local LMStudio/ComfyUI machine.
All you need is 16GB VRAM GPU and 32GB RAM to start, CPU doesn't matter. It's really that cheap.
Setup takes 3 weeks of pure suffering and you're ready for a true AI future, it'll pay off in less than a year.
Our videocards now can not only run games, but write somewhat useful code. That's pretty cool right?
And if part of your job or pipeline can actually be replaced by a parrot, maybe it should be replaced.
Think of writing and updating tests. If you're blank-staring at the wall right now, you get it.
Don't let LLMs think for you or build an architecture - it's all harmful random garbage.
24GB GPU VRAM + 64GB RAM (comfortable):
Gemma 4 48k - unsloth/gemma-4-31b-it@iq4_xs (temp 0.3, top k 64)
Xortron 56k - xortron.criminalcomputing.2026.27b.next@q5_k_m (temp 0.1, top k 20)
16GB GPU VRAM + 32GB RAM (starter):
Local Xortron 32k - xortron.criminalcomputing.2026.27b.next@iq3_xs (2 layers on CPU, Q8_0 KV Cache, temp 0.3, top k 20)
Global settings: Repetition Penalty 1.1, Min P Sampling 0.05, Top P Sampling 0.95.
This is how these settings work (yeah I know, pretty much every IT video).
If you can get anything done on 16GB GPU VRAM model, you should invest in RTX 3090 or a multi-GPU setup.
Every 8GB extra VRAM is an astronomic leap in quality. 16GB models are not even close to 24GB models.
Use OpenAI-compatible API to connect to LM Studio. The https://zed.dev/ seems to be best open-source agentic IDE.
Here are jinja templates for LM Studio and Zed. Very tedious to get right.
Put Responses MUST be terse and short. in a rule or system prompt, or use my portable caveman prompt.
Vision consumes a lot. Use Q8_0 or BF16 .mmproj files so you don't have to blind the model completely.
I use very low temperatures to prevent tool use typos/screwups, since I use LLMs mostly for refactoring.
Try not to use Q8_0 KV Cache, no matter how tempting or what the benchmarks say - it wrecks reasoning and tool calls.
To avoid Gemma 4 thinking bugs, use "<|channel>" as your reasoning start string, not "<|channel>thought".
All models should use 8k output token limit to prevent occasional very long useless loops when it fails a tool call.
Always disable Unified KV Cache and set Max Concurrent Prediction to 1, unless model is intended to work in parallel.
- ComfyUI-Enhancement-Utils - PC resource monitor and execution follower
- ComfyUI-SloppyAudio - Audio editing tools based on SoX and BS-RoFormer
- smol-caveman - Portable Caveman prompt designed for local LLMs. Read less slop and get much better results.
- ComfyUI-SloppyInstall.bat - Simplified pip install -r "requirements.txt" for custom nodes in portable ComfyUI.
- SloppyServer.bat - Single file local/Wi-Fi server for debugging multithreaded mobile Unity WebGL builds and other apps

