a agentk.it Browse tools
Back to Tools
cli ยท tool profile

OSWorld

[NeurIPS 2024] OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

cli Kimi
01

At a glance.

A compact read before the deeper capability notes and official setup links.

Fit snapshot
Format CLI
Category cli
Kimi
02

Core features.

Feature cards focus on what the tool helps users do, not generated setup commands.

01

If you wish to run the baseline agent used in our paper, you can execute the following command as an example under the GPT-4o pure-screenshot setting:

02

Parallel execution (example showing switching provider to docker)

03

The results, which include screenshots, actions, and video recordings of the agent's task completion, will be saved in the ./results (or other resultdir you specified) directory in this case.

04

For manual verification and examination of specific benchmark tasks, you can use the manual examination tool:

05

Please start by reading through the agent interface and the environment interface.

06

Correctly implement the agent interface and import your customized version in the run.py (for single-threaded execution) or scripts/python/runmultienv.py / scripts/python/runmultienvxxx.py (for parallel execution) file.

07

Afterward, you can execute a command similar to the one in the previous section to run the benchmark on your agent.

08

title={OSWorld: Benchmarking Multimodal Agents...

04

Agent / Skill / MCP / Workflow fit.

This panel keeps technical format separate from the user-facing AI category.

Tool type CLI
Use categories cli
Works with Kimi
05

Official setup path.

Generated install snippets are intentionally not mirrored here because they drift. The page links to source-owned setup docs instead.

06

Evidence and adoption notes.

These notes help a user decide whether to investigate the official project further.

Source repository last pushed at 2026-05-19T14:12:41Z.

Generated from source metadata; confirm operational details in the official project before adopting it.

Review the upstream license, maintenance activity, and issue history before using it in production.

Trusted source

Trace the origin before adopting.