# Multi-step tasks requiring navigation and decision-makingagent = DroidAgent( goal="Set up a new alarm for 7 AM with custom ringtone and label 'Work'", llm=llm, tools=tools, reasoning=True)
# Complex, multi-app workflows requiring verification and adaptive planningagent = DroidAgent( goal="Find the cheapest hotel in Manhattan for next weekend, compare prices across multiple booking apps, and share the best option with my team on Slack", llm=llm, tools=tools, reasoning=True, reflection=True)
Reflection is based on screenshots. Use it alongside an LLM model with vision capabilities (e.g., GPT-4o, Gemini-2.5-Flash etc.).
Vision capabilities are deactivated for the DeepSeek provider and require an LLM model with vision capabilities (e.g., GPT-4o, Gemini-2.5-Flash etc.).
By default, DroidAgent operates entirely without vision by leveraging Android’s Accessibility API to extract the UI hierarchy as XML. This approach is efficient and works well for most automation tasks.However, enabling vision capabilities allows the agent to take screenshots and visually analyze the device screen, which can be beneficial in specific scenarios:
Copy
Ask AI
# To enable vision capabilities, set `vision=True` in your agent configuration. agent = DroidAgent( goal="Open up TikTok and describe the content of the video you are seeing", llm=llm, tools=tools, vision=True)
Content-heavy applications: When apps contain complex visual elements, images, or layouts that aren’t fully captured by the XML hierarchy
Visual verification: For tasks requiring confirmation of visual elements or layouts
Enhanced context understanding: When UI structure alone doesn’t provide sufficient information for decision-making