HI I AM A BLOG

ABOUT

IDE Arena is a harness with various codebases (environments) and coding tasks designed to evaluate and improve autonomous "IDE agents" operating within full software repositories. We define IDE agents as AI models operating in a chat-based IDE environment with access to the same tools available in agent-enabled IDEs like Cursor. Each task simulates real, multi-file development workflows — feature implementation, bug fixing, refactoring, and performance optimization — requiring agents to reason coherently across backend and frontend code, configuration files, and test suites. Unlike terminal-based datasets, IDE Arena embeds agents directly inside runnable, full-stack environments (e.g., MERN, Flask, Django, FastAPI).

BIG HEADER

THIS IS A HEADER

hi i love cats this is a section
• Cats are cool
• Dogs are cool too

THIS IS A HEADER

hi i love cats this is a section
• Cats are cool
• Dogs are cool too

THIS IS A HEADER

hi i love cats this is a section
• Cats are cool
• Dogs are cool too

ANOTHER SECTION

Value For Model Training
• Tool calling: Trains agents to operate via structured tool interfaces (e.g., code search, read/edit file, run command), reflecting modern IDE agent workflows.
• Full-context reasoning: Trains models to edit and plan across 10+ interdependent files.
• Engineering realism: Mirrors real development cycles - writing, refactoring, and testing code.
• Reproducible verification: Every task includes runnable test suites, CI scripts, and Dockerized grading environments.

Dataset Sample and InstructSans
The sample dataset demonstrates a typical IDE-Agent repository. It is a small FastAPI backend that ingests Nginx/HTTP access logs and computes traffic analytics. Each task modifies or extends system functionality - for instance, adding anomaly detection, configurable defaults, or regex-based log filtering. The project illustrates how agents interact with multi-file Python environments, perform structured reasoning, and ensure all Dockerized tests pass.
• Golden (Oracle): Complete, correct reference implementations that define the expected outputs and serve as the ground truth.
• Stubbed (Null): Incomplete or placeholder implementations used as the actual test inputs. AI models attempt to complete these stubs, and their outputs are evaluated against the golden versions.

ANOTHER LAYOUT

IDE ARENA IS SUPER COOL

IDE Arena is the first comprehensive benchmark designed to evaluate AI agents in the environment where developers actually use them: inside the IDE.

ANOTHER LAYOUT

IDE ARENA IS SUPER COOL

IDE Arena is the first comprehensive benchmark designed to evaluate AI agents in the environment where developers actually use them: inside the IDE.

IDE Arena is the first comprehensive benchmark designed to evaluate AI agents in the environment where developers actually use them: inside the IDE.

IDE Arena is the first comprehensive benchmark designed to evaluate AI agents in the environment where developers actually use them: inside the IDE.

DASHBOARD

01020304050
Cursor
Cursor
Windsurf
Windsurf
Kiro
Kiro