Hacker News new | past | comments | ask | show | jobs | submit login
Personal computing paves the way for personal library science (bramadams.dev)
193 points by _bramses 15 days ago | hide | past | favorite | 51 comments



I work in a very interdisciplinary, and somewhat niche, tech/engineering field. For the past 15 years, I've been saving every relevant PDF that I can find -- mostly studies of the sort published by Elsevier and Springer, but also books and presentations. I now have around 10k, which probably makes it the largest private library focused on this particular domain of expertise.

It has been extremely useful, especially because it's text-searchable and the really important papers are properly categorized.

A local LLM will make it 100x more useful. Also, it might not even need be "local." If I make it available via the web, I can probably sell access to other scientists and engineers in my field.

Recent advances really benefit data hoarders out there.

I'd add that these days it totally makes sense to download libgen's entire archive, because (1) storage has never been cheaper, and (2) you can use it to train local LLMs.


> If I make it available via the web, I can probably sell access to other scientists and engineers in my field.

Out of curiosity. Does this statement come from complete ignorance of or complete disregard to copyright of the author?


> complete ignorance of or complete disregard to copyright

A question that could be appended to many LLM discussions!


but the LLM is no different than a human doing that & outputs COMPLETELY unique strings! - said the profiting robot parked at the gray zone law area waiting for lobyists turn it white


If it's research it very probably is at least partially publicly funded. Regardless of whatever the law says, I don't think it's immoral to take it and offer better services around it that will be useful enough that someone decides to pay.


Do you not see the hypocrisy in stating that someone should be able to take something partially publicly funded and profit from it while the creator of said work should not retain some rights over said profit? By extension of the transitive property of the nebulous "partially", the LLM wrapper should be provided for free with complete disregard to the wrapper's creator since it is a derivative of partially publicly funded work.


Rightly or wrongly, Elsevier has a market capitalisation of £62B for essentially doing what GP is proposing.


The hoarders of IP in this case usually have done 0 work to produce it (Elsevier et al).


Yeah, the LLM maybe - though nobody paid for the training costs in that case and that feels weird. If the public paid for a work, it should be able to use it and not be required to give away their own derivative work for free since they already paid for it through taxes.

While I rather like the idea of having to provide access to derivates of publicly funded works, I fear that people would rather not use it than invest money into innovative approaches of using it. Of course if the public pays for the training and development costs, then by all means it should be available.

And the library itself and the computing resources to operate it cost money that someone needs to pay. Publishers didn't pay for the research and yet they can profit from it - why this guy shouldn't?


I see you have encountered the first reality of "personal library".

Of course, the idea of selling access to your private digital collection (or a derivative model) is relatively more absurd than the idea of monopolizing the original published work... Even so, this is as good a time as any to reconsider the practicality of copyright.


OpenAI et similia seems to be doing just fine though


> I'd add that these days it totally makes sense to download libgen's entire archive, because (1) storage has never been cheaper, and (2) you can use it to train local LLMs.

Hardly. Data hoarding comes with most downsides of hoarding physical objects. It's just smaller, cheaper & easier to process.

There's people that get rid of any physical object they haven't used in the past year (or 2, or 5, whatever). This makes sense. Imho, every object you own falls in 3 categories:

1) Things you use on a (semi?) regular basis. They make your life easier/nicer.

2) Things that are valuable. For flexible metrics of what constitutes value (sentimental, nostalgia, monetary, insurance against 'disaster', ...)

Not having used something in a long time is a good hint it's not valuable.

3) Luggage. Whose value is negative. It doesn't provide anything, just takes up space, drains mental energy (and possibly other resources), and in doing so gets in the way of other pursuits.

Data is no different. For any single piece of it, you either use it from time to time, somehow derive value from it, or it is useless luggage that you drag around at a cost.

Apply good judgement in what to hoard.


> Data hoarding comes with most downsides of hoarding physical objects. It's just smaller, cheaper & easier to process.

So the same downsides, except ... much better?

> There's people that get rid of any physical object they haven't used in the past year (or 2, or 5, whatever)

> Data is no different. ... you either use it from time to time

I suppose the real question then is what timeframe do you consider "using it from time to time"? It seems to me that this depends very much on the person, and likely on the objects themselves - probably a large appliance you haven't used in a year is likelier to be thrown away than a small item you haven't used in several. Considering data storage is smaller, cheaper, and easier to manage, I suppose a reasonable timeframe for keeping it, just in case, would be a lot longer than physical items.

This of course says nothing of the societal and cultural value such archivists safeguard.


i just keep everything ive bothered to keep, (data and stuff) i put it in a sensible place and it only consumes time when looking for it.

the new external drive is so much larger than the oldest ones i put the new stuff on it and use the rest of the space to backup old drives.

friends are constantly purging stuff but they seem unaware how much time and effort it takes.

the significant other wanted to clean up her old photos rather than upgrade the icloud. its an insane amount of work?


The data hoarder community would encourage you to release that collection for free, not try to profit from it.


> A local LLM will make it 100x more useful. Also, it might not even need be "local." If I make it available via the web, I can probably sell access to other scientists and engineers in my field.

That's not legal. Just because you own some files does not mean that you own the IP of the content within.


It is, however, the right thing to do.


Profiting off it is not the right thing to do.


Correct. Once you go communist you can't go back. Otherwise you're just a thief.

That’s OpenAI’s business model tho


How would you go about training an llm on 10k large pdfs?


> The "during" is hard work, and very lonely work. There are no promises of success, and indeed, the path is one where you can't see more than three feet ahead of you and you exist on the cliff's edge of extinction by any silly mishap. The work of "during" is exhausting, and it constantly holds you taut and alert, afraid of the shadows that lurk beyond the campfire's edge.

Well said. All anyone can do is to do the lonely work till you can't anymore or you find friends to not be lonely at that work anymore.


there may be an important divergence implied by this essay .. people here ask about using an LLM.. but the essay refers to "different photographs of the same scene from different photographers" or other personal collection items that are related but subjective or not-authoritative

There is a rush in public to condense and summarize many authoritative publications to find patterns, or to replace a human expert with automated results.. yet that is fundamentally different than taking multiple incomplete perspectives to add to a human library-owners knowledge and investigations.

It is subtle to speak it but not subtle in its implications.. taking "data as facts" and condensing them or reordering them or rewriting an output based on them, using automation, is different than a human mind taking in many inputs for human mind knowledge and enabling new outputs from a human author.


> There is a rush in public to condense and summarize many authoritative publications to find patterns, or to replace a human expert with automated results.. yet that is fundamentally different than taking multiple incomplete perspectives to add to a human library-owners knowledge and investigations. It is subtle to speak it but not subtle in its implications.. taking "data as facts" and condensing them or reordering them or rewriting an output based on them, using automation, is different than a human mind taking in many inputs for human mind knowledge and enabling new outputs from a human author.

You nailed it! Thanks for noticing the divergence!


There's lots of interesting work that came out of BCL in 1960s, https://en.wikipedia.org/wiki/Biological_Computer_Laboratory

> The focus of research at BCL was systems theory and specifically the area of self-organizing systems, bionics, and bio-inspired computing; that is, analyzing, formalizing, and implementing biological processes using computers. BCL was inspired by the ideas of Warren McCulloch and the Macy Conferences, as well as many other thinkers in the field of cybernetics.

On cybernetics, https://www.pangaro.com/definition-cybernetics.html

> Artificial Intelligence (AI) grew from a desire to make computers smart, whether smart like humans or just smart in some other way. Cybernetics grew from a desire to understand and build systems that can achieve goals.. it connects control (actions taken in hope of achieving goals) with communication (connection and information flow between the actor and the environment).. Later, Gordon Pask offered conversation as the core interaction of systems that have goals.


I've had related ideas lurking at the back of my mind for a while now. Essentially, I want to save more things locally and and interact with it. For example, I have a bunch of book notes stored in Bear. I'd like to be able to ask questions about those notes, and also show the pages of the book itself.


Try Zenfetch. It's designed for this use case.


Thanks for mentioning Zenfetch :)

Happy to answer any questions


What is Zenfetch?


Personal RAG. Connect your existing bookmarks/web browsing/notes into a knowledge library with AI search and chat over top it


Personal computing has stagnated for such a long time, it creates substantial uncertainty about what state it might evolve to if and when the next step actually happens.

In this respect local LLM's are simply the tip of the iceberg, pointing out the vast amount of personal information processing that is available in principle but does not actually happen.


One could argue that personal computing (desktop) software piracy lead to web-based SaaS subscription licensing. In theory, mobile app stores solved device software piracy, at the cost of high distribution fees, policy restrictions and telemetry.

Thanks to Linux being used at scale in Android and WSL, it's now maintained and capable on the desktop, as a hypothetical foundation for personal computing innovation. But even there, native GUI toolkits took a backseat to web and CLI. Remember Chandler? http://www.osafoundation.org/

Investors poured small fortunes into cauldrons of smart devices, wearables and AR/VR, with little to show as nascent ecosystems failed to achieve escape velocity, due to closed hardware and software that forestalled the experimentation which birthed personal computing.

Apple Silicon has reinvigorated walled laptops. Hopefully next month's derivative Qualcomm SoC from PC OEMs can offer good price/performance/watt for Apple-competitive-yet-open Arm laptops and tablets that can run any Linux distro, with retail SSDs and RAM, plus AI silicon roadmap.

A modular Framework Arm laptop would be a good start to rebooting PC innovation.


How does slightly improved laptop hardware relate to re-invigorating desktop software? Surely desktop computing has stagnated because most users are primarily or exclusively mobile users. In Mac land Apple has been progressively dumbing down their interfaces, in Windows land Microsoft is more focused on extracting maximum value from their users than trying to meaningfully improve their platform. In Linux land there are some interesting things happening with Nix/Guix around declarative system configurations, and around Fedora with its layered images+Flatpak distros for making systems more reliable, and System76 may be doing something novel interface-wise with Cosmic marrying powerful tiling/tabbing window layouts with intuitive controls and the niceties of an all-in-one desktop environment. From my perspective desktop computing is definitely advancing, but only for hobbyists, not for mainstream desktop operating systems.


> How does slightly improved laptop hardware relate to re-invigorating desktop software?

If Arm SystemReady laptops with good performance/watt have an open security foundation (declarative, immutable OS at EL2) to support multiple competing "app store" equivalents on Linux, the resulting revenue and competitive market can reward innovative desktop software - open, closed or hybrid. Without an Apple tax on storage and memory, funds can be redirected to a competitive market of smaller ISVs.


> the resulting revenue and competitive market can reward innovative desktop software - open, closed or hybrid.

Outside of Steam, not a single software distributor (including Canonical) has been able to do this. Linux succeeds in spite of everything you mentioned and none of it would particularly enable the sort of experience you're describing.


Unfortunately so far Qualcomm was hard at work avoiding making an open platform ARM-based laptop/tablet - to the point of squeezing around MS rules on that through special drivers etc to make Windows think it's dealing with EFI-compliant hardware.


That's disappointing, since mainline Linux support has made progress, https://www.linaro.org/blog/qualcomm-and-linaro-enable-lates...


The main issue is that what's upstreamed is essentially drivers for the SoCs - but the firmware of Qualcomm-powered laptops tends to be not fully compliant.

So it's easy to make a device powered by one when you control how linux is booted on it, but for whatever reason things like EFI NVRAM interface on windows qualcomm powered laptops was done non-standard and the only reason windows works is because there are drivers shipped which work around it - and I seriously doubt its intended by Microsoft, because Microsoft actually benefits from devices following their official, documented, hardware-interface specs - it makes for easy upgrades, reinstalls, etc. etc.


> Personal Library Science is the leverage of LLM technology, applied to a personal library. A personal library differs from a impersonal library in the fact that a personal library is an interpretation of a source material. These interpretations include: photographs from different photographers at the same event, or favorite scenes from a movie, or favorite passages from books, parts of songs that bring you to tears, etc. Importantly, these interpretations create unique sets that go on to create unique problems which require unique, idiosyncratic solutions.

Would an LLM-driven "Personal Library" require manually annotated textual interpretation of each curated item, or could it derive personal interpretations from user history and the uniqueness of curated items/sets?

For those who have been using local, offline LLMs with a manually curated text/image corpus, what have been the most valuable or surprising use cases?

Author demo video (2023), https://youtube.com/watch?v=7TgqMRz2r3M & tooling comment (2024), https://news.ycombinator.com/item?id=39789712

> Inspired by the commonplace book format, I take highlights from Kindle and embed them in a DB. From there I build (multiple) downstream apps but the central one, Commonplace Bot is a bot that serves as a retrieval and transformer for said highlights.

Related: https://en.wikipedia.org/wiki/Lifelog


> Would an LLM-driven "Personal Library" require manually annotated textual interpretation of each curated item, or could it derive personal interpretations from user history and the uniqueness of curated items/sets?

I’ve personally found that tagging is less robust than LLM embeddings (mainly due to dimensionality), but human appended thoughts about a source — also embedded — serve even better as tags.

Example: “this is a quote about dinosaurs…” (Old way of doing things) Tags: dinosaurs, jurassic, history Query: “dinosaurs” > results = 1…

(New way of doing things) Embedded Quote: [0.182…] User Added Thought: “this dinosaur reminds me of a time i went to six flags with my cousins and…” Embedded User Added Thought: [0.284…]

Query: “dinosaurs” > results = 2 (indexes = sources, thoughts)

The "thoughts" index can do a second layer cosine similarity search and serve as a tag on its own to fetch similar concepts. Basically a tree search created by similarity from user input/feedback loops.


> Would an LLM-driven "Personal Library" require manually annotated textual interpretation of each curated item

No. In something like this you’d probably have the LLM annotate and curate your personal library for you.

Potentially by creating and assigning tags or topics based on the content of your library.


Yea. I had this discussion not too long ago about this. I'd love to have a combination of a library (Personal Knowledge Management style), data ingestions, and a current world view/state.

The PKM is the stored info to write to and query against (both for LLMs and humans). The data ingests are just a pipeline of digital inputs to the system, like chat logs, maybe (transcribed) webcam feeds, files i'm currently editing on desktop, browsing history, etc. The current world view is the interpretation of what i'm doing - to tie all the ingests together and give them context. Eg in isolation browsing some Rust crates might not be that useful. But if i'm also editing Project X on my computer then it's reasonable to assume the searching is related to X. However if it's been 8 hours since any Project X activity, it's less likely related. Same goes for context-less chat logs (as happens frequently in my house) where they are extensions of a voice conversation, etc.

All of this stuff is of course insanely privacy invading, so i'd only implement this locally. I also wouldn't even store most of it for fear of data invasion, but using it to fuel a PKM automatically seems pretty sexy. Like browser history, but for your life.

This is all just wishful thinking though, LLMs have been moving too fast for me to even bother toying with this. I should note though that i did not intend for LLMs to be "smart". Rather, in a RAG-like fashion (i think is the term), i want to just let LLMs do what they're good at - summarization & autocomplete, and let the world view / PKM store the real data.


FWIW LLMs have been advancing on benchmarks but the practical usage of them (RAG, React, CoT, etc) hasn’t really changed much in the past year.


I imagine an LLM could work well for doing autocomplete while saving and annotating documents. But it’s not personal unless you edit the result to say what you want to say.


Years ago I spent thousands of hours trying to figure out how to organize a digital library.

My final answer was to use the Library of Congress catalog system. They need to add some sub-categories for how-to explanations.

Then have a field for media type (video vs. PDF vs. image)

Then note the style of presentation (academic vs. folksy vs. a manual vs. a dad showing you how to do this)

Then note the language


Not too long ago, I managed to pretty much ruin the wiki for a small (and at that time opensource) CNC machine by using it as my personal notebook --- probably my usage of it thus was a big part of why it was left off-line when the person hosting it moved.

You can see it on the Wayback Machine:

https://web.archive.org/web/20211127090321/https://wiki.shap...

In retrospect, I should have put some of that effort into:

https://en.wikibooks.org/wiki/Hobbyist_CNC_Machining

although since then, a machine owner worked up:

https://shapeokoenthusiasts.gitbook.io/shapeoko-cnc-a-to-z

I still regret a bunch of stuff I didn't keep copies of, esp. the scans of Barry Hughart's notes for his novels.

The irony is that one can see a bit of the result of discussion of this sort of thing at the top of one's browser window --- the URL bar, where URL == "Uniform Resource Locator" --- the originally proposed term was "Universal Resource Locator", but the argument against that was that people were not librarians, and that unlike Ted Nelson's Xanadu, there wouldn't an over-arching data structure and organization, so a given document wouldn't have a single canonical location.

Anyone interested in this sort of thing who hasn't read it, should read Tim Berner-Lee's book:

https://www.w3.org/People/Berners-Lee/Weaving/Overview.html


Best quote from the article: "...personal library science is focused on your relationship with your information. How do we store information so that it useful at a later date? How do we transform our information into new valuable assets in different creative domains? How do we do all of this while being flexible enough for the idiosyncrasies, proclivities, likes and dislikes of eight billion distinct individuals? How do we chronicle the information diet of a single person as they learn new things, interact with the world at different phases in their life? How do we make sure we can pass down our best knowledge to generations below?"



For me, on macOS, FoxTrot Professional has been my personal file data indexer. Its essential go-to feature for me is sophisticated searching, including a form of regex. Also, true wildcards, no stopwords, and proximity searches put it far out in front of anything else I have tried, including many recent LLM local-docs tools. I have millions of files in dozens of formats in hundreds of GB gathered over decades (many digitized by me) and it handles it all like a champ, though an SSD drive and a late-model Mac is a must at that size. And backups, cause ain't nobody want to lose that.


If one remembers NeXT included all sorts of non-computer documents and literature. The idea of storing vasts amounts of data for personal use was at the dawn of the PC era.


I miss Librarian.app --- it was quite useful for a project of mine:

https://tug.org/TUGboat/Articles/tb24-2/tb77adams.pdf

(basically used copies of _The Bible_ and The Works of Shakespeare to determine if a given set of letters appeared in the English language or no)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: