OSWorld
[NeurIPS 2024] OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
At a glance.
A compact read before the deeper capability notes and official setup links.
Core features.
Feature cards focus on what the tool helps users do, not generated setup commands.
If you wish to run the baseline agent used in our paper, you can execute the following command as an example under the GPT-4o pure-screenshot setting:
Parallel execution (example showing switching provider to docker)
The results, which include screenshots, actions, and video recordings of the agent's task completion, will be saved in the ./results (or other resultdir you specified) directory in this case.
For manual verification and examination of specific benchmark tasks, you can use the manual examination tool:
Please start by reading through the agent interface and the environment interface.
Correctly implement the agent interface and import your customized version in the run.py (for single-threaded execution) or scripts/python/runmultienv.py / scripts/python/runmultienvxxx.py (for parallel execution) file.
Afterward, you can execute a command similar to the one in the previous section to run the benchmark on your agent.
title={OSWorld: Benchmarking Multimodal Agents...
Agent / Skill / MCP / Workflow fit.
This panel keeps technical format separate from the user-facing AI category.
Official setup path.
Generated install snippets are intentionally not mirrored here because they drift. The page links to source-owned setup docs instead.
Evidence and adoption notes.
These notes help a user decide whether to investigate the official project further.
Source repository last pushed at 2026-05-19T14:12:41Z.
Generated from source metadata; confirm operational details in the official project before adopting it.
Review the upstream license, maintenance activity, and issue history before using it in production.