I don’t have a lot of knowledge on the topic but happy to point you in good direction for reference material. I heard about tensor layer offloading first from here a few months ago. In that post is linked another to MoE expert layer offloadingI highly recommend you read through both post. MoE offloading it was based off
The gist of the Tensor Cores strategy is Instead of offloading entire layers with --gpulayers, you use --overridetensors to keep specific large tensors (particularly FFN tensors) on CPU while moving everything else to GPU.
This works because:
You need to figure out which cores exactly need to be offloaded for your model looking at weights and cooking up regex according to the post.
Heres an example of a kobold startup flags for doing this. The key part is the override tensors flag and the regex contained in it
python ~/koboldcpp/koboldcpp.py --threads 10 --usecublas --contextsize 40960 --flashattention --port 5000 --model ~/Downloads/MODELNAME.gguf --gpulayers 65 --quantkv 1 --overridetensors "\.[13579]\.ffn_up|\.[1-3][13579]\.ffn_up=CPU"
...
[18:44:54] CtxLimit:39294/40960, Amt:597/2048, Init:0.24s, Process:68.69s (563.34T/s), Generate:56.27s (10.61T/s), Total:124.96s
The exact specifics of how you determine which tensors for each model and the associated regex is a little beyond my knowledge but the people who wrote the tensor post did a good job trying to explain that process in detail. Hope this helps.
I would recommend you get a cheap wattage meter that plugs inbetween wall outlet and PSU powering your cards for 10-15$ (the 30$ name brand kill-a-watts are overpriced and unneeded IMO). You can try to get rough approximations doing some math with your cards listed TPD specs added together but that doesn’t account for motherboard, cpu, ram, drives, so on all and the real change between idle and load. With a meter you can just kind of watch the total power draw with all that stuff factored in, take note of increase and max out as your rig inferences a bit. Have the comfort of being reasonably confident in the actual numbers. Then you can plug the values in a calculation
I have not tried any models larger than very low quant qwen 32b . My personal limits for partial offloading speeds are 1 tps and the 32b models encroach on that. Once I get my vram upgraded from 8gb to 16-24gb ill test the waters with higher parameters and hit some new limits to benchmark :) I haven’t tried out MoE models either, I keep hearing about them. AFAIK they’re popular with people because you can do advanced partial offloading strategies between different experts to really bump the token generation. So playing around with them has been on my ml bucket list for awhile.
Oh, I LOVE to talk, so I hope you don’t mind if I respond with my own wall of text :) It got really long, so I broke it up with headers.
TLDR: Bifurcation is needed because of how fitting multiple GPUs on one PCIe x16 lane works and consumer CPU PCIe lane management limits. Context offloading is still partial offloading, so you’ll still get hit with the same speed penalty—with the exception of one specific advanced partial offloading inference strategy involving MoE models.
To be clear about CUDA, it’s an API optimized for software to use NVIDIA cards. When you use an NVIDIA card with Kobold or another engine, you tell it to use CUDA as an API to optimally use the GPU for compute tasks. In Kobold’s case, you tell it to use cuBLAS for CUDA.
The PCIe bifurcation stuff is a separate issue when trying to run multiple GPUs on limited hardware. However, CUDA has an important place in multi-GPU setups. Using CUDA with multiple NVIDIA GPUs is the gold standard for homelabs because it’s the most supported for advanced PyTorch fine-tuning, post-training, and cutting-edge academic work.
But it’s not the only way to do things, especially if you just want inference on Kobold. Vulkan is a universal API that works on both NVIDIA and AMD cards, so you can actually combine them (like a 3060 and an AMD RX) to pool their VRAM. The trade-off is some speed compared to a full NVIDIA setup on CUDA/cuBLAS.
Bifurcation is necessary in my case mainly because of physical PCIe port limits on the board and consumer CPU lane handling limits. Most consumer desktops only have one x16 PCIe slot on the motherboard, which typically means only one GPU-type device can fit nicely. Most CPUs only have 24 PCIe lanes, which is just enough to manage one x16 slot GPU, a network card, and some M.2 storage.
There are motherboards with multiple physical x16 PCIe slots and multiple CPU sockets for special server-class CPUs like Threadrippers with huge PCIe lane counts. These can handle all those PCIe devices directly at max speeds, but they’re purpose-built server-class components that cost $1,000+ USD just for the motherboard. When you see people on homelab forums running dozens of used server-class GPUs, rest assured they have an expensive motherboard with 8+ PCIe x16 slots, two Threadripper CPUs, and lots of bifurcation. (See the bottom for parts examples.)
Information on this stuff and which motherboards support it is spotty—it’s incredibly niche hobbyist territory with just a couple of forum posts to reference. To sanity check, really dig into the exact board manufacturer’s spec PDF and look for mentions of PCIe features to be sure bifurcation is supported. Don’t just trust internet searches. My motherboard is an MSI B450M Bazooka (I’ll try to remember to get exact numbers later). It happened to have 4x4x4x4 compatibility—I didn’t know any of this going in and got so lucky!
For multiple GPUs (or other PCIe devices!) to work together on a modest consumer desktop motherboard + CPU sharing a single PCIe x16, you have to:
A secondary reason I’m bifurcating: the used server-class GPU I got for inferencing (Tesla P100 16GB) has no display output, and my old Ryzen CPU has no integrated graphics either. So my desktop refuses to boot with just the server card—I need at least one display-output GPU too. You won’t have this problem with the 3060. In my case, I was planning a multi-GPU setup eventually anyway, so going the extra mile to figure this out was an acceptable learning premium.
Bifurcation cuts into bandwidth, but it’s actually not that bad. Going from x16 to x4 only results in about 15% speed decrease, which isn’t bad IMO. Did you say you’re using a x1 riser though? That splits it to a sixteenth of the bandwidth—maybe I’m misunderstanding what you mean by x1.
I wouldn’t obsess over multi-GPU setups too hard. You don’t need to shoot for a data center at home right away, especially when you’re still getting a feel for this stuff. It’s a lot of planning, money, and time to get a custom homelab figured out right. Just going from Steam Deck inferencing to a single proper GPU will be night and day. I started with my decade-old ThinkPad inferencing Llama 3.1 8B at about 1 TPS, and it inspired me enough to dig out the old gaming PC sitting in the basement and squeeze every last megabyte of VRAM out of it. My 8GB 1070 Ti held me for over a year until I started doing enough professional-ish work to justify a proper multi-GPU upgrade.
Offloading context is still partial offloading, so you’ll hit the same speed issues. You want to use a model that leaves enough memory for context completely within your GPU VRAM. Let’s say you use a quantized 8B model that’s around 8GB on your 12GB card—that leaves 4GB for context, which I’d say is easily about 16k tokens. That’s what most lower-parameter local models can realistically handle anyway. You could partially offload into RAM, but it’s a bad idea—cutting speed to a tenth just to add context capability you don’t need. If you’re doing really long conversations, handling huge chunks of text, or want to use a higher-parameter model and don’t care about speed, it’s understandable. But once you get a taste of 15-30 TPS, going back to 1-3 TPS is… difficult.
Note that if you’re dead set on partial offloading, there’s a popular way to squeeze performance through Mixture of Experts (MoE) models. It’s all a little advanced and nerdy for my taste, but the gist is that you can use clever partial offloading strategies with your inferencing engine. You split up the different expert layers that make up the model between RAM and VRAM to improve performance—the unused experts live in RAM while the active expert layers live in VRAM. Or something like that.
I like to talk (in case you haven’t noticed). Feel free to keep the questions coming—I’m happy to help and maybe save you some headaches.
Oh, in case you want to fantasize about parts shopping for a multi-GPU server-class setup, here are some links I have saved for reference. GPUs used for ML can be fine on 8 PCI lanes (https://www.reddit.com/r/MachineLearning/comments/jp4igh/d_does_x8_lanes_instead_of_x16_lanes_worsen_rtx/)
A Threadripper Pro has 128 PCI lanes: (https://www.amazon.com/AMD-Ryzen-Threadripper-PRO-3975WX/dp/B08V5H7GPM)
You can get dual sWRX8 motherboards: (https://www.newegg.com/p/pl?N=100007625+601362102)
You can get a PCIe 4x expansion card on Amazon: (https://www.amazon.com/JMT-PCIe-Bifurcation-x4x4x4x4-Expansion-20-2mm/dp/B0C9WS3MBG)
All together, that’s 256 PCI lanes per machine, as many PCIe slots as you need. At that point, all you need to figure out is power delivery.
Thanks for being that guy, good to know. Those specific numbers shown were just done tonight with DeepHermes 8b q6km (finetuned from llama 3.1 8b) with max context at 8192, in the past before I reinstalled I managed to squeeze ~10k context with the 8b by booting without a desktop enviroment. I happen to know that DeepHermes 22b iq3 (finetuned from mistral small) runs at like 3 tps partially offloaded with 4-5k context.
Deephermes 8b is the fast and efficient general model I use for general conversation, basic web search, RAG, data table formatting/basic markdown generation, simple computations with deepseek r1 distill reasoning CoT turned on.
Deephermes 22b is the local powerhouse model I use for more complex task requiring either more domain knowledge or reasoning ability. For example to help break down legacy code and boilerplate simple functions for game creation.
I have vision model + TTS pipeline for OCR scanning and narration using qwen 2.5vl 7b + outetts+wavtokenizer which I was considering trying to calculate though I need to add up both the llm tps and the audio TTS tps.
I plan to load up a stable diffusion model and see how image generation compares but the calculations will probably be slightly different.
I hear theres one or two local models floating around that work with roo-cline for the advanced tool usage, if I can find a local model in the 14b range that works with roo even if just for basic stuff it will be incredible.
Hope that helps inform you sorry if I missed something.
No worries :) a model fully loaded onto 12gb vram on a 3060 will give you a huge boost around 15-30tps depending on the bandwidth throughput and tensor cores of the 3060. Its really such a big difference once you get a properly fit quantized model your happy with you probably won’t be thinking of offloading ram again if you just want llm inferencing. Check to make sure your motherboard supports pcie burfurcation before you make any multigpu plans I got super lucky with my motherboard allowing 4x4x4x4 bifuraction for 4 GPUs potentially but I could have been screwed easy if it didnt.
I just got second PSU just for powering multiple cards on a single bifurcated pcie for a homelab type thing. A snag I hit that you might be able to learn from: PSUs need to b turned on by the motherboard befor e being able to power a GPU. You need a 15$ electrical relay board that sends power from the motherboard to the second PSU or it won’t work.
Its gonna be slow as molasses partially offloaded onto regular ram no matter what its not like ddr4 vs ddr3 is that much different speed wise. It might maybe be a 10-15% increase if that. If your partially offloading and not doing some weird optimized MOE type of offloading expect 1-5token per second (really more like 2-3).
If youre doing real inferencing work and need speed then vram is king. you want to fit it all within the GPU. How much vram is the 3060 youre looking at?
What does an MCP server do?
The point in time after the first qbit based supercomputers transitioned from theoretical abstraction to physical proven reality. Thus opening up the can-of-worms of feasabily cracking classical cryptographic encryptions like an egg within human acceptable time frames instead of longer-than-the-universes-lifespan timeframes… Thanks, superposition probability based parallel computations.
Thank you for deciding to engage with our community here! You’re in good company.
Kobold just released a bunch of tools for quant making you may want to check out.
I have not made my own quants. I usually just find whatever imatrix gguf bartowlski or the other top makers on HF release.
I too am in the process of upgrading my homelab and opening up my model engine as a semi public service. The biggest performance gains ive found are using CUDA and loading everything in vram. So far just been working with my old nvidia 1070ti 8gb card.
Havent tried vllm engine just kobold. I hear good things about vllm it will be something to look into sometime. I’m happy and comfortable with my model engine system as it got everything setup just the way I want is but I’m always open to performance optimization.
If you havent already try running vllm with its CPU nicencess set to highest priority. If vllm can use flash attention try that too.
I’m just enough of a computer nerd to get the gist of technical things and set everything up software/networking side. Bought a domain name, set up a web server and hardened it. Kobolds webui didnt come with https SSL/TLS cert handling so I needed to get a reverse proxy working to get the connection properly encrypted.
I am really passionate about this even though so much of the technical nitty gritty under the hood behind models goes over my head. I was inspired enough to buy a p100 Tesla 16gb and try shoving it into an old gaming desktop which is my current homelab project. I dont have a lot of money so this was months of saving for the used server class GPU and the PSU to run it + the 1070ti 8gb I have later.
The PC/server building hardware side scares me but I’m working on it. I’m not used to swapping parts out at all. when I tried to build my own PC a decade ago it didnt last long before something blew so there’s a bit of residual trauma there. I’m worried about things not fit right in the case, or destroying something or the the card not working and it all.
Those are unhealthy worries when I’m trying to apply myself to this cutting edge stuff. I’m really trying to work past that anxiety and just try my best to install the stupid GPU. I figure if I fail I fail thats life it will be a learning experience either way.
I want to document the upgrade process journey on my new self hosted site. I also want to open my kobold service to public use by fellow hobbyist. I’m not quite confident in sharing my domain on the public web though just yet I’m still cooking.
Have you by chance checked out kobold.cpp lite webUI? It allows some of what your asking for like RAG for worldbuilding, adding images for the llm to describe to add into the story, easy editing of input and output, lots of customization in settings. I have a public instance of kobold webui setup on my website and I’m cool with allowing fellow hobbyist using my compute to experiment with things. If your interested in trying it out to see if its more what youre looking for, feel free to send me a pm and I’ll send you the address and a api key/password.
In an ideal work what exactly would you want an AI integrated text editor to do? Depending on what you need to have happen in your workflow you can automate copy pasting and automatic output logging with python scripts and your engines api.
Editing and audiing stories isnt that much different from auditing codebases. It all boils down to the understanding and correct use of language to convey abstraction. I bet tweaking some agebic personalities and goals in vscode+roo could get you somewhere
Nice post Hendrik thanks for sharing your knowledge and helping people out :)
I once got kobold.CPP working with their collection of TTS model+ wav tokenizer system. Here’s the wiki page on it.
It may not be as natural as a commercial voice model but may be enough to wet your appetite in the event that other solutions feel overwhelmingly complicated
Wow this is some awese information Brucethemoose thanks for sharing!
I hope you dont mind if I ask some things. Tool calling is one of those things I’m really curious about. Sorry if this is too much please dont feel pressured you dont need to answer everything or anything at all. Thanks for being here.
I feel like a lot of people including myself only vaguely understand tool calling, how its supposed to work, and simple practice excersises to use it on via scripts and APIs. What’s a dead simple python script someone could cook to tool call within the openai-compatable API?
In your own words what exactly is tool calling and how does an absolute beginner tap into it? Could you clarify what you mean by ‘tool calling being built into their tokenizers’?
Would you mind sharing some sources where we can learn more? I’m sure huggingface has courses but maybe you know some harder to find sources?
Is tabbyAPI an engine similar to ollama, llama.cpp, ect?
What is elx2,3, ect?
Yes it would have been awesome of them to release a bigger one for sure :( At the end of the day they are still a business that needs a product to sell. I don’t want to be ungrateful complaining that they dont give us everything. I expect some day all these companies will eventually clam up and stop releasing models to the public all together once the dust settles and monopolies are integrated. I’m happy to be here in an era where we can look forward to open licence model released every few months.
Devstral was released recently specifically trained for tool calling in mind. I havent personally tried it out yet but people say it works good with vscode+roo
The thing is that even if there isn’t much energy in plastic to be extracted, theres still enough energy in it to make a viable food source. Now, consider the humble koala and its primary food source, fucking eucalyptus leaves. Eucalytis is such a dogshit food source that koalas had to spend evolutionary time and energy just to spec into it. To the point they cant eat anything else pretty much. Combine that with the fact that eucalyptis leaves are so devoid of nutrients that the koala has to spend all day every day just snacking on them to not die of malnutrition.
Why? Why would a species even bother with this flim-flam if eucalypti sucks that bad as a food source? The answer is: Food scarcity. Because eucalytis grows everywhere where koalas live and because nobody else is bothering to tap into the food source, this sets up a ecological niche by pretty much gaurenteeing any animal that sucessfully finds a way to make it work will have unlimited amounts of food/energy just from the fact theres so damn much of it and nothing else wants to/can touch it. Sure koalas might have paid the price by sacrificing some brain wrinkles but who needs higher intelligence when you have leaves to snack on and sex to make babies.
A similar thing happened with trees and mushrooms. In the deep evolutionary history of our planet trees were once the apex forms of life with forest covering pretty much the whole planet. This is because nothing knew how to eat the wood stems for a good couple million years. Most of the coal and oil that we dig up today is actually the preserved remains of these unbroken down trees from the carboniferous period that just layed there petrified never rotting until the earths techtonic movement buried the tree corpses deep enough in the mantle for the carbon to compress into hard rock or squeezed+heated into liquid. The great change in the era happened when our humble mycelium bois finally figured out how to eat wood, causing them to essentially become the new apex life for a time by taking advantage of an unlimited and untapped food source (trees).
I suppose my point is to not underestimate the willingness of life to find new food sources. microorganisms don’t need much excuse just a slight amount of selective pressure and a couple million/billion/trillion generations of evolutionary trial and error. Which for bacteria takes maybe a couple of years I forget how quick modern microorganism colonies make new generations but its FAST. Add some science nerds who love to play God/intelligent evolution with CRISPR tek and gene tagging, yeah for sure well get plastic eating microbes figured out. Then begins the pandoras box of plastic rotting when we dont want it to.
Havent heard of this one before now. It will be interesting to see how it actually performs. I didnt see what license the models will be released under hope its a more permissive one like apache. Their marketing should try cooking up a catchy name thats easy to remember. It seems they’re a native western language company so also hope it doesnt have too much random Chinese characters like qwen does sometimes
Ive never really gotten into MoE models, people say you can get great performance gains with clever partial offloading strategy between various experts. Maybe one of these days!
Its just base level pet tribalism for the sake of a cheap comic perpetuating a stereotype beaten to death. “my choice in non-human species as extended family is better than YOURS!”. All conscious entities have a unique combination of emotional understanding and relationship building over time, most higher thinking social animals understand the concept of affection and bonding.
The difference is cats didnt have the same evoltionary pressures to essentially force affection and bonding drive on a species wide level. Some cats are cold and distant who at most will only allow a quick pet, some are warm and cuddly giving hugs. Thats personality for you.
Meanwhile almost all dogs are compulsively clingy and protective its deeply engrained into pretty much all breeds unconscious instincts. Some are only friendly with known family not allowing strangers to bond but If you are family in their eyes youre pretty much set.
The only real advantage dogs have over cats is they are good active tracking/ hunting partners with the right breed. Also bigger ones are lethal to their last breath which makes the neurotically scared and anxious people feel safe.
You know what dog owners never want to talk about though? The fucking poop. Dogs poop everywhere in the back yard and just about every dog owner Ive ever known is too lazy to pick it up making their back yards pretty much undesirable for social activities or to have children run around playing. Also depending on the breed, the separation anxiety that causes them to go berserk and rip up trashcans and furniture.