Hacker News new | past | comments | ask | show | jobs | submit login
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computers (os-world.github.io)
77 points by kristianpaul 15 days ago | hide | past | favorite | 39 comments



Thank you for making this available!

Check out https://github.com/OpenAdaptAI/OpenAdapt for a cross platform (Mac and Windows) open source library that learns to perform tasks in desktop apps by observing human demonstrations.

We believe a major shortcoming with conventional approaches to AI agents is expecting them to be able to figure tasks of arbitrary complexity from first principles. While understandable from an academic perspective, this is unnecessary for practical utility, since humans perform these tasks constantly.

With OpenAdapt you can demonstrate to a model how to perform a task, then have it take over the task, with additional user-supplied natural language instructions.

I have created an issue to evaluate OpenAdapt on OSWorld: https://github.com/OpenAdaptAI/OpenAdapt/issues/642. Contributions welcome!

Edit: from https://github.com/xlang-ai/OSWorld/tree/main/evaluation_exa...:

> The ./trajectories file contains the annotated trajectories for each data item in ./examples for finishing the task.

Unfortunately this file does not appear to be included in the repo. I have submitted an issue here: https://github.com/xlang-ai/OSWorld/issues/30


Buried in their presentation is the current effectiveness of agents to complete desktop computing tasks.

Humans are able to complete the tasks given at 70%+ effectiveness while the best model is at 12% (GPT4-v). Most of the other models were <5% effective.


I am not an AGI stan, but it is clear to me that this gap will close substantially, and quickly ... like 6 months or so. The break-neck speed of development and (to me) the counterintuitive zeal with which open source developers are just donating tools and code that are going to hamstring them somewhat pretty soon is breathtaking. Think about how "most" companies use tech. Most companies are not big, they have a minimum set of tools to get what they need done. They currently, and routinely, have human mistakes. These are the types of tasks that agents trained by human demonstration can do soon. The only one that I suspect will be slow for adoption is bookkeeping, because you could theoretically end up on the wrong side of the IRS with mistakes.


This is a thoughtful article on the current limitations of LLM and agents: https://www.strangeloopcanon.com/p/what-can-llms-never-do

A good example is that you can teach a first grader how to play wordle, but an LLM can’t be taught.

That’s a pretty big gap to a situation where most humans can be replaced.


Except that the problems most (and by that I mean non-software, non-billion-dollar) businesses have are not wordle type puzzle problems. They are 'go through all of our invoices and put xxxx data in a spreadsheet for me please' type problems or 'copy the data from this app and put it in that app' type problems or responding to emails or responding to simple customer support questions. And these are definitely in the wheelhouse of LLM near-future capabilities.


And the human performers of many of these tasks have been replaced not by AIs but by VBA scripts. It's almost a trope that many Hackernews' first jobs were something of this type, and on the first day they wrote a VBA script to do most or all of it and then collected pay for, effectively, playing Minesweeper all day.


This is true, and how I got into software development as a job, but going forward they will likely be replaced by AI agents off of some site like Zapier.


I think this is a clear case where “human-level” is a totally meaningless expression, since you have to ask about which humans[1]. On one hand, these were CS undergrads, so naively I would expect their score to be at least a bit higher than the white-collar worker average. On the other hand, these undergrads weren’t actually familiar with most of the software being tested - surely less familiar than GPT-4! And on the other-other hand I would expect an actual IT professional to do better. There is a difficult conflation with subject-matter knowledge and actual intelligence which makes “human-level” especially problematic. “CS undergrad-level” doesn’t roll off the tongue, but “human-level” seems downright misleading.

Nevertheless, this is a striking result, indicating categorical weakness in out-of-distribution reasoning. And I think these tasks being so familiar - real human problem-solving where your ability to think makes up for gaps in knowledge - helps illustrate the problem with transformers: they’re just not smart enough.

[1] I get very annoyed with discussions about “AGI” for systems that are totally incapable of solving problems that hunter-gatherers need to solve. But that’s another rant.


Give it time. The technology is a year old.


Gotta love people working on replacing themselves. Jokes aside, seeing an AI interacting with a computer is kind of scary. It's not just outputting text anymore, it's doing the full work of a human working on a computer, meaning... a ton of people


Gotta love people working on replacing themselves.

So here's the thing: I want the computer to replace me... with regards to things that I don't want to do, or that aren't a good use of my time (boring, repetitive, trite, tedious, etc.). BUT the caveat is, I want it to be in a context where I'm in control of the computer and own the output, etc.

Now companies and other actors also want the computer to replace me, but they want to be in control and to own the output. But here's the rub: they're not going to stop trying to come up with ways to replace me and minimize my value just because I don't like it. No man can stop the tide from coming in, or something like that (insert pithy saying here?). But if you know the tide is coming, you can retreat further inland OR build a boat, or climb a pole, etc.

So where I'm at with this is, the more I work on AI, the more I know about AI, the more I can use it to enhance my own value outside of the context of corporate controlled IT systems, the more I can exercise a (possibly insignificant, but maybe not) degree of control over my fate. I can try to build my own boat, or at least outrun the tide for a little while.

I don't know how things are going to turn out, and there's a reason that I am not in fact joking when I crack the occasional remark about "after this conversation I'm going to run to Bass Pro Shops and pick up a few more boxes of ammo" or whatever. Maybe a cyberpunk dystopia is what awaits us all. Maybe not. I'm just going to try to make my best guess at what the possible / likely future scenarios are and take whatever actions I can now to situate myself as favorably as I can.


We're all pretending we can predict what the future looks like, we can't we can't juggle that many variables. It seems the most we can imagine is the world, just as it is today, but with 1 or 2 rich guys commanding an army of AI systems to pamper them while we're all our in some baron wasteland starving and being culled by robots, at least that seems to be what's causing the anxiety.

If you think about it, it's one of the simplest things you could extrapolate from "THE RISE OF AI!", but as if reality is that simple? I think it shows more about our natural tendency to think catastrophically.


"Hope for the best, plan for the worst."


There has to be a practical limit to the amount of energy we give something.


I don't disagree. And don't mistake me, I don't sit around spending all day worrying about the impending collapse of society, or watching repeats of Doomsday Preppers or whatever. :-)

But I do spend some time thinking about possible future scenarios and trying to figure out if there are any (reasonable, for some value of "reasonable") actions I can take today to help "future me" deal with certain eventualities that might come to pass.


> "after this conversation I'm going to run to Bass Pro Shops and pick up a few more boxes of ammo" or whatever. Maybe a cyberpunk dystopia is what awaits us all.

I don't mean this to sound accusatory, but do you think there's at all a connection between the probability of a dystopic future and the prevalence of people betting that that's the future?


It's a fair question. Honestly, I don't know. Will give it some more thought.


I appreciate that.

I certainly can't blame anyone for seeing the state of the world and wanting to prepare for the worst - especially those with a family - and I'm not suggesting one just ignore that reality and become hopelessly naive about the future.

I just... I don't know. It's hard to see everyone preparing for the worst and not feel like it's part of the reason that reality feels so likely. I hope I'm wrong. I hope we're both wrong. :P

Anyway I do appreciate your response and not just jumping to the conclusion I'm a silly optimist.


it’s peculiar to me that our lives have become that shit and uninspiring our biggest fear is not being able to sit in front of the computer anymore. If the computer replaces me and I starve to death , well that’s tough but I’m sure as shit not going to worry about office life being preserved for eternity.


Their fear isn’t ”not being able to sit in front of the computer anymore.” That should help explain the peculiarity.


So what is the fear then?

I truly think a lot of people think their purpose is to be some type of processing unit for the economic machine and if that's taken away, they'll just dissolve into some type of depressive goop.


You’re the one who said “starve to death” in your original comment! Maybe that’s a hint as to what the fear is: mass unemployment without universal basic income.

This is one of the most bizarrely bad-faith comments I’ve ever seen.


Fair enough, but I meant why have anxiety about it? The anxiety will do nothing to solve it.

I don't imagine a more efficient world will some how create starvation? It's just not worth worrying about it because if that's how it's going to go, that's how it's going to go.

I personally think the destruction of the personal compute age will be on of the greatest things that's ever happened to us.

Ever dread having to sit at a computer desk pecking away at a keyboard on a beautiful sunny day? Well maybe that going away is what will happen, and it will be glory. I'd honestly prefer to be tending to my tomato plants and hiking in the forrest and drinking tea on a mountain peak.


I would guess its job loss, which subsequently covers food, shelter, those needs for their dependents, etc.


It will change pretty fast is everyone loses their job. We've created this reality honestly.

We've had the technology for a very long time to easily provide food, clothing and shelter for everyone probably 10x over. It just doesn't happen because > 50% of people are doing ok.

In other words, we mostly play the game that suits us.


It's not going to be pretty fast. It will be gradual. 0.5% employment loss this year. 1% loss the next year. 1.2% the year after.

Ten years later the middle class is almost non-existent. Unemployment is 25% and pundits are calling it "the new normal". You and your wife and your two kids now live with her parents. You share a three bedroom in the suburbs.

Every morning you both leave early to stand amongst fifty others outside the local Amazon Mechanical Turk offices. Once a software developer with a thriving career, you're now just another face in the crowd, hoping for a day's work training AI — your deep knowledge reduced to manual inputs for algorithms you once could have designed.

The daily scramble for work, even with your computer science degree, has become a grim routine. The young woman with the tablet doesn’t even meet your eyes as she checks you in. "You're training AI for advanced behavioral prediction in smart home systems."

Your wife misses out. Her PHD in physics is no help.

There's still some good jobs out there, but the supply vastly outstrips demand. Your kids witness the daily tension and uncertainty that you and your wife face, the scramble for sporadic employment, the diminishing returns on your education and experience. This environment is their new normal; they adapt, but at a cost. The stress and instability shape their expectations from life, narrowing their dreams to mere thoughts of making it through each day. They have no idea how great the 90s truly were.

This generational shift means that the skills, values, and aspirations that once propelled the middle class are forgotten or dismissed as irrelevant.

As the middle class dissolved under the relentless march of AI and automation, the notion of Universal Basic Income (UBI) was quickly dismissed as "unrealistic" and "impossible" by policymakers. Tax rates for the capital-rich continued to plummet, entrenching wealth in the hands of a few. This stark contradiction highlighted a society deeply invested in maintaining the status quo, prioritizing wealth accumulation at the top while the concept of a basic, stabilizing income for the displaced masses easily cast aside as an unfeasible dream.

This grim future is not merely possible but an impending reality on our current trajectory, where technological advancements are harnessed to consolidate wealth at the top, and equitable solutions like UBI are dismissed as fantasies. The future promises a deepening divide and escalating societal strife.

The idea of a sudden, dramatic loss of jobs sparking immediate and widespread unrest underestimates the more insidious nature of technological displacement. This process is neither swift nor explosive; rather, it unfolds gradually, almost imperceptibly, eroding job security and living standards slowly over years. By the time the consequences fully manifest (soaring unemployment and entrenched wealth inequality) it feels too late to act.


Every morning you both leave early to stand amongst fifty others outside the local Amazon Mechanical Turk offices. Once a software developer with a thriving career, you're now just another face in the crowd, hoping for a day's work training AI — your deep knowledge reduced to manual inputs for algorithms you once could have designed.

I don't know if you're trolling or not but this is an absurd take on what an AI infused future would look like


Just a modern version of the "shape-ups" or hiring lines from the Great Depression. Just as workers lined up each day back then for a chance at temporary jobs, the scenario I described with people queuing at Amazon Mechanical Turk offices mirrors that.

Of course you wouldn't literally line up at a building, you'd just go and sign up from the comfort of your old laptop on the dining room table: https://www.mturk.com/get-started

The reality is the "gig economy" is the digitally disrupted version of job lines. We're already there.


I mean look how great things have been since then, since the great depression?

If we're already in it, we're already one step closer to the post depression era utopia for the 30s and 40s.

You've made a strange comparison though because the great depression had little to do with technological advancement?


It wasn’t that deep a comparison. I just liked the imagery of people lining up for work outside “Mechanical Turk Centers”.

It’s more a nod to the 25% unemployment of the Great Depression.


I built a small Python script so I could let GPT-4 debug my system issues:

https://github.com/skorokithakis/sysaidmin

It works surprisingly well!


What’s the most complicated thing it was able to debug?


I had a library missing and it started using ldd to see where things were linked, and it changed the library path to the proper location, it was impressive.


Sounds like the state of your computer is the real problem here.


As opposed to?


https://nixos.org/

But just in general, I've been a Linux pro for about 15 years, it's extremely rare I'd have to manually link anything?


Well, when you need to run software that needs a library and your system doesn't have it because the library is too old, what do you do?


Coming soon: Human-trained AI that can actuate a robotic hand to fill in paper forms with a Selectric typewriter. The doom of us all!


Dumb question - What actually needs to be done to close the gap?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: