@jessegenet I highly recommend turning an abliterated 30b model into your personal psychiatrist. Started over a week ago and it’s amazing. I need to curate the profile md a bit but wow can these things be a mirror.
Running a local Qwen setup on this PC using Intel OpenVINO GenAI + Intel Arc graphics.
What’s been really shocking is that most of the time while the GPU is cranking out tokens I can actually use the computer at the same time and it barely slows down. I bought this box a year and a half ago having no clue that I would be able to use it for local inference. I’ve also tried GPT – OSS – 20 B and while it’s slow, it generates extremely high value tokens, very impressed. It’s a version that is also set up for the openVINO.
Hardware:
• PC: GEEKOM GT2 Mega
• CPU: Intel Core Ultra 9 285H
• Cores/threads: 16 cores / 16 logical processors
• RAM: 32 GB installed
• GPU: Intel Arc 140T GPU
• GPU memory reported by driver name: 16 GB
• GPU driver: Intel 32.0.101.8509
• NPU: Intel AI Boost
• System type: x64-based PC
Software/runtime:
• Host OS: Windows with WSL2
• WSL distro: Ubuntu 22.04.5 LTS
• WSL kernel: 6.6.87.2-microsoft-standard-WSL2
• OpenVINO lane is Windows-side, separate from the IPEX Ollama runtime
• Python: 3.13 virtual environment
• Packages: openvino-genai, huggingface_hub
• Model server: custom OpenAI-compatible HTTP server
• Server bind: 0.0.0.0:8099
• Inference mode: single-threaded queue, one generate() at a time
• Active model: qwen3-8b-int4-ov
• Model path: C:\Users\\models\openvino\Qwen3-8B-int4-ov
• Device target: GPU
• Open WebUI connects to it through WSL/Docker using the Windows host gateway
• Open WebUI sees the model as: qwen3-8b-int4-ov
• Startup is automated from Windows Startup with a hidden PowerShell launcher
Verified live:
• OpenVINO GenAI server listening on port 8099
• /v1/models returns qwen3-8b-int4-ov
• Open WebUI can see and use the model
• Server logs show successful chat completions on GPU
Recent observed responses from the server:
• 9 estimated completion tokens in 1.969s
• 48 estimated completion tokens in 4.286s
• 98 estimated completion tokens in 7.669s
This setup is running Qwen3 8B INT4 locally on Intel Arc 140T using OpenVINO GenAI, exposed through an OpenAI-compatible API, and connected into Open WebUI.
Running Gemma-4-26B-A4B-NVFP4 on my DGX Spark GB10 via vLLM.
Results: ~30 tok/s single-stream, 53 tok/s at c=2.
For context: Nemotron-3-Super 120B on the same hardware does 14 tok/s.
The MoE architecture is the reason — only 3.8B of 26B parameters activate per token step. Memory bandwidth stops mattering when you’re barely touching the weights.
18 GB weights. 2-min cold start. Full 128k context, parallel tool calls. Native image, audio, and video support. Should I try FP8 + MTP?
@0xSero I had a shocking Orwellian experience yesterday where Claude got righteous with me. It scared the crap out of me. No doubt it would have contacted my boss if it could’ve. I’m unplugging from Anthropic. Open source will save humanity. No doubt in my mind.
@0xSero I’m assuming you mean you can remote access a codex or claude session running on a local model. It’s really cool. I have had issues copying text on mobile though.