Hacker News new | past | comments | ask | show | jobs | submit login

I run models using HF just fine. I mean I’m using HF transformers repo, which gets models from HF hub.

Or do you mean commercial deployment of models for inference?




Are you talking about the Hugging Face Python libraries, the Hugging Face hosted inference APIs, the Hugging Face web interfaces, the Hugging Face iPhone app, Hugging Face Spaces (hosted Docker environments with GPU access) or something else?


I updated my comment above: I’m using HF transformers repo, which gets models from HF hub.


Do you have an NVIDIA GPU? I have not had much luck with the transformers library on a Mac.


Of course. I thought Nvidia GPUs are pretty much a must have to play with DL models.


Well being able to run these models on CPU was pretty much the revolutionary part of llama.cpp.


I can run them on CPU - HF uses plain Pytorch code - fully supported on CPU.


But it's likely to be much slower than what you'd get with a backend like llama.cpp on CPU (particularly if you're running on a Mac, but I think on Linux as well), as well as not supporting features like CPU offloading.


Are there benchmarks? 2x speed up would not be enough for me to return to c++ hell, but 5x might be, in some circumstances.


I think the biggest selling point of ollama (llama.cpp) are quantizations, for a slight hit (with q8 or q4) in quality you can get a significant performance boost.


Does ollama/llama.cpp provide low bit operations (avx or cuda kernels) to speed up inference? Or just model compression with inference still done in fp16?

My understanding is the modern quantization algorithms are typically implemented in Pytorch.


Sorry I don't know much about this topic.

The only thing I know (from using it) that with quantization I can fit models like llama2 13b, in my 24GB of VRAM when I use q8 (16GB) instead of fp16 (26GB). This means I can get nearly the full quality of llama2 13b's output while still being able to use only my GPU, without the need to do very slow inference on only CPU+RAM.

And the models are quantized before inference, so I'd only download 16GB for the llama2 13b q8 instead of the full 26GB, which means it's not done on the fly.


As an aside, even gpt4 level quality does not feel satisfactory to me lately. I can’t imagine willingly using models as dumb as llama2-13b. What do you do with it?


Yeah I agree, everytime a new model releases I download the highest quantization or fp16, that fits into my VRAM, test it out with a few prompts, and then realize that downloadable models are still not as good as the closed ones (except speed wise).

I don't know why I still do it, but everytime I read so many comments how good model X is, and how it outperforms anything else, and then I want to see it for myself.


There's a Python binding for llama.cpp which is actively maintained and has worked well for me: https://github.com/abetlen/llama-cpp-python


Ollama supports many radeons now. And I guess llama.cpp does too, after all it's what ollama uses as backend.


PyTorch (the underlying framework of HF) supports AMD as well, though I haven’t tried it.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: