Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Data Bonsai: a Python package to clean your data with LLMs (github.com/databonsai)
47 points by alvin_r_h 17 days ago | hide | past | favorite | 6 comments
I've been doing some data cleaning for my fine tuning projects using LLMs, and decided to just build a package for it as a side project. Check it out here: https://github.com/databonsai/databonsai

Some features:

- categorization (labelling), transformation and decomposition (text into structured format) - validates llm outputs

- batch mode batches up the inputs/outputs so you don't send the prompt (schema, fewshot examples) for every row of data, saving a significant amount of tokens

There are some similarities to the Instructor repo, but this is simpler and made for datasets. Would love any feedback/suggestions (and a star if you like it!)




Oh I'm interested to see how your batch prompt works. I've used the idea for a while and feel that it's very underrated.


Looks handy. How reliable would you say it is?


Interested in knowing this as well


GPT-4 and claude models work great, but these cost some money. Some users were very interested in running these on Ollama, but it didn't work very well for any batch methods.

ExtractTransformer looks like it has a lot of potential. Going to try this out tomorrow at $DAYJOB.


Excellent, thanks for sharing! will definitely give it a shot!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: