Core Concepts
Agent & Execution Modes
Understanding the DroidAgent system in DroidRun
Configuration
Execution Modes
The agent operates in three distinct modes, each optimized for different complexity levels and use cases.
Direct Execution
REASONING: LOWSPEED: HIGH
Flow: Goal → Direct Execution → Result
Best Practices:
- Use for single-action tasks (1-15 steps)
- Keep goals specific and atomic
- Faster execution with no planning overhead
Planning Mode
REASONING: MEDIUMSPEED: MEDIUM
Flow: Goal → Planning → Step-by-step Execution → Result
Best Practices:
- Use for multi-step tasks (15-50 steps)
- Ideal for navigation between apps/screens
- Good for tasks requiring step-by-step breakdown
Reflection Mode
REASONING: HIGHSPEED: LOW
Flow: Goal → Planning → Execution → Reflection → Re-planning (if needed) → Result
Best Practice:
- Use for complex workflows (50+ steps)
- Essential for quality control and verification
- Best when context preservation is critical
Vision capabilities
Vision capabilities are deactivated for the DeepSeek provider and require an LLM model with vision capabilities (e.g., GPT-4o, Gemini-2.5-Flash etc.).
By default, DroidAgent operates entirely without vision by leveraging Android’s Accessibility API to extract the UI hierarchy as XML. This approach is efficient and works well for most automation tasks.
However, enabling vision capabilities allows the agent to take screenshots and visually analyze the device screen, which can be beneficial in specific scenarios:
- Content-heavy applications: When apps contain complex visual elements, images, or layouts that aren’t fully captured by the XML hierarchy
- Visual verification: For tasks requiring confirmation of visual elements or layouts
- Enhanced context understanding: When UI structure alone doesn’t provide sufficient information for decision-making