π€ ReAct Agent
DroidRun uses a ReAct (Reasoning + Acting) agent to control Android devices. This powerful approach combines LLM reasoning with concrete actions to achieve complex automation tasks.π What is ReAct?
ReAct is a framework that combines:- Reasoning: Using an LLM to interpret tasks, make decisions, and plan steps
- Acting: Executing concrete actions on an Android device
- Observing: Getting feedback from actions to inform future reasoning
π The ReAct Loop
1
Goal Setting
The user provides a natural language task like βOpen settings and enable dark modeβ
2
Reasoning
The LLM analyzes the task and determines what steps are needed
3
Action Selection
The agent selects an appropriate action (e.g., tapping a UI element)
4
Execution
The action is executed on the Android device
5
Observation
The agent observes the result (e.g., a new screen appears)
6
Further Reasoning
The agent evaluates progress and decides on the next action
π οΈ Available Actions
The ReAct agent can perform various actions on Android devices:UI Interaction
UI Interaction
tap(index)- Tap on a UI element by its indexswipe(start_x, start_y, end_x, end_y)- Swipe from one point to anotherinput_text(text)- Type text into the current fieldpress_key(keycode)- Press a specific key (e.g., HOME, BACK)
App Management
App Management
start_app(package)- Launch an app by package namelist_packages()- List installed packagesinstall_app(apk_path)- Install an app from APKuninstall_app(package)- Uninstall an app
UI Analysis
UI Analysis
take_screenshot()- Capture the current screen (vision mode only)get_clickables()- Identify clickable elements on screenextract(filename)- Save complete UI state to a JSON file
Task Management
Task Management
complete(result)- Mark the task as complete with a summary
πΈ Vision Capabilities
When vision mode is enabled, the ReAct agent can analyze screenshots to better understand the UI:- Visual Context: The LLM can see exactly whatβs on screen
- Better UI Understanding: Recognizes UI elements even if text detection is imperfect
- Complex Navigation: Handles apps with unusual or complex interfaces more effectively
π Token Usage Tracking
The ReAct agent now tracks token usage for all LLM interactions:- Cost Management: Track and optimize your API usage costs
- Performance Tuning: Identify steps that require the most tokens
- Troubleshooting: Debug issues with prompt sizes or response lengths
π§ Agent Parameters
When creating a ReAct agent, you can configure several parameters:π Step Types
The agent records its progress using different step types:- Thought: Internal reasoning about what to do
- Action: An action to be executed on the device
- Observation: Result of an action
- Plan: A sequence of steps to achieve the goal
- Goal: The target state to achieve
π‘ Best Practices
- Clear Goals: Provide specific, clear instructions
- Realistic Tasks: Break complex automation into manageable tasks
- Vision for Complex UIs: Enable vision mode for complex UI navigation
- Step Limits: Set reasonable max_steps to prevent infinite loops
- Device Connectivity: Ensure stable connection to your device
- Token Optimization: Monitor token usage for cost-effective automation