Has anyone tried to use llama.cpp with NVLink?

Question

Apparantly its possible to pool the memory of two 3090 using NVLink (although not with 4090). This would make it possible to run large LLM's on consumer hardware.

https://huggingface.co/transformers/v4.9.2/performance.html

Although before I invest into a new GPU, I would like to verify that it actually works, since conventional wisdom used to be that SLI only doubled performance, not memory.

So has anyone tried yet? Whats the token rate?

score 3 · Accepted Answer · answered Sep 30 '23 at 01:38

memory pooling is not really much of a thing these days: the interface is not really that all of a sudden you get a single address space. You still have individual GPUs, you just specify the enablement of peer transfers. This makes sense from a design point of view, because in order to write efficient software the software has to really be aware of which data is on which physical device so that efforts can be made to optimize and reduce (unnecessary) transfers.

Therefore I'd say the premise of your question is flawed. Perhaps it should be edited.

Anyway, I'm running llama.cpp with dual 3090 with NVLink enabled. llama.cpp does have implemented peer transfers and they can significantly speed up inference. For example 10 tok/s -> 17 tok/s for a 70B model.

Has anyone tried to use llama.cpp with NVLink?

1 Answers1