this looks very awesome. can someone tell me why there is no chatter about this?...

brrrrrm · 2024-04-28T23:28:35

HN isnt really the best space for LLM news - r/LocalLlama and twitter are much better. I think HN has some cultural issues with “AI” news

wkat4242 · 2024-04-28T23:53:49

Hmm I don't think so. Most comments are pretty positive.

I think the articles are just not really upvoted unless it's really big news, makes sense because HN is for more than just AI.

But I don't think it's anti-AI like most people here would be pretty anti-cryptocurrency (and for good reason IMO)

p1esk · 2024-04-29T00:51:25

I didn’t upvote it because I don’t use Ollama. To experiment with LLMs I use Huggingface. Does Ollama provide something I cannot get with Huggingface?

lolinder · 2024-04-29T04:05:47

Ollama provides a web server with API that just works out of the box, which is great when you want to integrate multiple applications (potentially distributed on smaller edge devices) with LLMs that run on a single beefy machine.

In my home I have a large gaming rig that sometimes runs Ollama+Open WebUI, then I also have a bunch of other services running on a smaller server and a Raspberry Pi which reach out to Ollama for their LLM inference needs.

p1esk · 2024-04-29T04:37:23

Sure, maybe it’s better for niche use cases like yours.

HF is the biggest provider of llms, and I guess I haven’t run into it’s limitations yet.

jkh1 · 2024-04-29T05:47:01

Running locally is sometimes necessary, e.g. you don't want to send sensitive data to any random third party server.

Zambyte · 2024-04-29T13:04:53

Both Ollama and Huggingface distribute models. The latter sites have model hosting services too, but that isn't the only way to use models from there.

gertop · 2024-04-29T01:06:46

Hugging face is a model repository.

Ollama allows you to run those models.

Different things.

p1esk · 2024-04-29T01:13:16

I run models using HF just fine. I mean I’m using HF transformers repo, which gets models from HF hub.

Or do you mean commercial deployment of models for inference?

simonw · 2024-04-29T01:19:20

Are you talking about the Hugging Face Python libraries, the Hugging Face hosted inference APIs, the Hugging Face web interfaces, the Hugging Face iPhone app, Hugging Face Spaces (hosted Docker environments with GPU access) or something else?

p1esk · 2024-04-29T01:20:38

I updated my comment above: I’m using HF transformers repo, which gets models from HF hub.

simonw · 2024-04-29T01:23:50

Do you have an NVIDIA GPU? I have not had much luck with the transformers library on a Mac.

p1esk · 2024-04-29T01:33:24

Of course. I thought Nvidia GPUs are pretty much a must have to play with DL models.

objektif · 2024-04-29T01:41:21

Well being able to run these models on CPU was pretty much the revolutionary part of llama.cpp.

p1esk · 2024-04-29T01:48:50

I can run them on CPU - HF uses plain Pytorch code - fully supported on CPU.

tmostak · 2024-04-29T03:34:35

But it's likely to be much slower than what you'd get with a backend like llama.cpp on CPU (particularly if you're running on a Mac, but I think on Linux as well), as well as not supporting features like CPU offloading.

p1esk · 2024-04-29T04:04:14

Are there benchmarks? 2x speed up would not be enough for me to return to c++ hell, but 5x might be, in some circumstances.

SushiHippie · 2024-04-29T08:00:48

I think the biggest selling point of ollama (llama.cpp) are quantizations, for a slight hit (with q8 or q4) in quality you can get a significant performance boost.

p1esk · 2024-04-29T15:57:11

Does ollama/llama.cpp provide low bit operations (avx or cuda kernels) to speed up inference? Or just model compression with inference still done in fp16?

My understanding is the modern quantization algorithms are typically implemented in Pytorch.

SushiHippie · 2024-04-29T23:28:00

Sorry I don't know much about this topic.

The only thing I know (from using it) that with quantization I can fit models like llama2 13b, in my 24GB of VRAM when I use q8 (16GB) instead of fp16 (26GB). This means I can get nearly the full quality of llama2 13b's output while still being able to use only my GPU, without the need to do very slow inference on only CPU+RAM.

And the models are quantized before inference, so I'd only download 16GB for the llama2 13b q8 instead of the full 26GB, which means it's not done on the fly.

p1esk · 2024-04-30T04:21:39

As an aside, even gpt4 level quality does not feel satisfactory to me lately. I can’t imagine willingly using models as dumb as llama2-13b. What do you do with it?

SushiHippie · 2024-04-30T12:34:32

Yeah I agree, everytime a new model releases I download the highest quantization or fp16, that fits into my VRAM, test it out with a few prompts, and then realize that downloadable models are still not as good as the closed ones (except speed wise).

I don't know why I still do it, but everytime I read so many comments how good model X is, and how it outperforms anything else, and then I want to see it for myself.

simonw · 2024-04-29T16:02:34

There's a Python binding for llama.cpp which is actively maintained and has worked well for me: https://github.com/abetlen/llama-cpp-python

wkat4242 · 2024-04-29T01:57:39

Ollama supports many radeons now. And I guess llama.cpp does too, after all it's what ollama uses as backend.

p1esk · 2024-04-29T02:04:15

PyTorch (the underlying framework of HF) supports AMD as well, though I haven’t tried it.

chadsix · 2024-04-28T23:15:33

Ollama is really organized - it relies on llama but the UX and organization it provides makes it legit. We recently made a one-click wizard to run Open WebUI and Ollama together, self hosted and remotely accessible but locally hosted [1]

[1] https://github.com/ipv6rslimited/cloudseeder

gertop · 2024-04-29T01:03:47

LM Studio is a lot more user friendly, probably the easiest UI to use out there. No terminal nonsense, no manual to read. Just double click and chat. It even explains to you what the model names mean (eg diff between Q4_1 Q4_K Q4_K_M... For whatever reason all the other tools assume you know what it means).

Built-in model recommendations are also handy.

Very friendly tool!

However it's not open-source.

Cheer2171 · 2024-04-29T01:55:59

Why do you think there is no chatter about this? There have been hundreds of posts about ollama on HN. This is a point release of an already well known project.

FieryTransition · 2024-04-28T23:15:45

I use a mix of using llamacpp directly via my own python bindings and using it via llamacpp-python for function calling and full control over parameters and loading, but otherwise ollama is just great for ease of use. There's really not a reason not to use it, if just want to load gguf models and don't have any intricate requirements.

CharlesW · 2024-04-29T00:07:18

I can recommend LM Studio and Msty if you're looking for something with an integrated UX.

perrygeo · 2024-04-30T12:33:16

Opposite reaction here. I was just thinking, man I hear about Ollama every single day on HN. Not sure a point release is news :-)

throw03172019 · 2024-04-28T23:01:14

Lola a has been brought up many times on HN. It’s a great tool!