When Independence Meets Uncertainty: My Journey with AI-Powered Vision

A blind user's candid assessment of the promises and pitfalls of current AI accessibility tools

Jun 30, 2025

The thermostat in my apartment has a sleek digital interface that's completely inaccessible to me beyond actually having buttons, something we see less and less of these days. For months, I'd relied on Be My AI—taking a photo, waiting for the analysis, asking follow-up questions through text prompts and sometimes additional uploads—a process that worked but felt cumbersome. So when Gemini Live's camera features became available to all users this spring and I finally got my hands on a Google Pixel Pro Fold, I decided to test it against my go-to solution.

I opened Gemini Live, switched to video mode, pointed my phone at the control panel, and simply asked it to help me lower the temperature. The response was immediate and natural: "I can see your thermostat is currently set to 78 degrees. To lower it, you'll want to press the down arrow on the right side of the display." I then moved my finger to the right, asked it for clarification, then asked it again after pressing the instructed button what it was set to now. "It looks like the display now reads 74 degrees. Is there anything else I can assist you with?"

Within seconds, we were much cooler, and I felt like I'd glimpsed the future of accessible technology.

Two weeks later, that same optimism crashed into reality when I spent two frustrating hours trying to operate an unfamiliar dryer at a friend's house, with Gemini confidently giving me incorrect instructions twice before I gave up and moved onto other solutions.

This is the current state of AI-powered accessibility: revolutionary potential constrained by fundamental reliability issues that create a peculiar form of technological anxiety. We're no longer limited by what these tools can't do—we're limited by whether we can trust what they tell us they can do.

The Evolution of Visual Assistance

The journey to this moment began long before AI. For decades, blind and low-vision people have navigated a world designed for sight through various forms of assistance. Initially, this meant relying on family, friends, or the kindness of strangers to read mail, identify objects, or navigate unfamiliar spaces. The emotional cost of this dependency often went unrecognized—the hesitation before asking someone to read personal medical results, the awkwardness of acquiring shopping assistance at the grocery store, the simple desire to maintain privacy over everyday tasks.

The introduction of services like Be My Eyes and Aira offered a revolutionary alternative: professional assistance available 24/7 through smartphone connectivity. Be My Eyes connects users with volunteer sighted helpers through live video calls, while Aira provides trained agents for more complex navigation tasks. Both services solved the availability problem—help was just a phone call away. But they also introduced new challenges: the social anxiety of interacting with strangers, the vulnerability of sharing potentially personal visual information, and in Aira's case, the substantial monthly cost.

When AI-powered tools emerged, they promised to eliminate the human element entirely. No more explaining your situation to a volunteer, no more worry about inconveniencing someone, all with cheaper and in some cases no subscription fees. Just point, ask, and receive instant information about the visual world around you.

This promise has largely been fulfilled—with significant caveats.

The Current Landscape

Today's AI accessibility tools fall into several distinct categories, each with different interaction models and use cases:

Static Image Analysis Tools like Be My AI and Seeing AI require users to capture photos and receive text or audio descriptions. Be My AI, powered by GPT-4 Vision, excels at detailed image analysis and allows text-based follow-up questions about uploaded images. However, examining different aspects of a scene—like adjusting a thermostat and then confirming the change—still requires uploading separate images for each visual state, making the process feel cumbersome compared to continuous video analysis. Seeing AI offers specialized "channels" for different tasks (documents, people, products) but operates on the same capture-and-analyze model.

Continuous Scanning Tools like Google Lookout (Android only) and certain Envision AI features provide ongoing analysis of the camera feed, offering real-time audio feedback about objects and text in view. These work well for general environmental awareness but can become overwhelming in busy visual environments.

Conversational Video Analysis represents the newest category, with Gemini Live leading the way. This approach combines continuous video input with natural language interaction, allowing users to have flowing conversations about what the camera sees. While processing typically involves 1-3 second delays for cloud-based analysis, the interface feels remarkably natural in practice—similar to describing something to a sighted friend—though the technology behind it is still developing.

Each approach has trade-offs in accuracy, convenience, and use case suitability. Understanding these differences is crucial for setting appropriate expectations and choosing the right tool for specific tasks.

The Trust Calculation

Working with AI accessibility tools requires a complex psychological recalibration that goes beyond simply learning new technology. With human volunteers, the reliability equation was straightforward: humans might make mistakes, but they express uncertainty, ask for clarification, and generally don't fabricate information. The vulnerability was social—sharing personal visual information with strangers—but the information itself was trustworthy.

AI tools invert this equation. The social barrier disappears—no human interaction required—but the information reliability becomes uncertain. These systems can confidently provide incorrect information about critical details, and they do so with the same authoritative tone they use for accurate responses.

This creates what I've come to think of as "verification culture" among blind AI users. Almost every experienced user I know has developed strategies for double-checking AI responses. I routinely run the same question through multiple AI tools or multiple sessions with the same tool, looking for consistency. For anything important—medicine labels, financial documents, navigation in unfamiliar places—I still default to human verification when possible.

The result is a peculiar form of semi-independence. These tools dramatically expand what I can explore and understand independently, but they require constant mental calibration about when to trust the information they provide.

Use Case Hierarchy

Experience with these tools reveals a clear hierarchy of appropriate applications, though individual effectiveness varies significantly based on personal context, environment, and the specific visual challenge at hand.

Low-Risk Applications where AI tools generally excel include:

Casual exploration: Walking around your house or neighborhood and asking about interesting objects, architectural details, or scenery changes
Restaurant menus and shopping: Getting information about options and prices when accuracy isn't critical to safety or health
Social media: Understanding memes, photos friends share, or visual content where mistakes are merely annoying rather than consequential
Reading printed materials: Books, magazines, non-critical documents where you can verify confusing information through context

Medium-Risk Applications requiring personal judgment and often verification include:

Navigation assistance: Using AI to identify street signs, building entrances, or public transportation information while combining this input with traditional mobility skills and environmental awareness
Technical interfaces: Computer screens, appliance controls, or device settings where mistakes cause inconvenience rather than danger
Product identification: Distinguishing between similar items when the consequences of errors are manageable

High-Risk Applications where human intuition and verification remain essential include:

Medical information: Reading medication labels, dosage instructions, or medical test results where errors could have serious health consequences
Financial documents: Banking information, bills, or contracts where mistakes could cause financial harm
Safety-critical tasks: Street crossing, operating heavy machinery, using dangerous equipment, or navigating hazardous environments
Emergency situations: Any scenario where quick, accurate information is essential for safety

The key insight is that these tools work best as information gathering assistants that enhance human decision-making rather than autonomous agents that replace human judgment. I might use AI to gather details about an environment—identifying signs, landmarks, or locations—while combining this input with my own navigational skills and environmental awareness. Conversely, while I might use it to help navigate a low-traffic parking lot in conjunction with my mobility training, I wouldn't rely on it for crossing busy streets where safety depends on real-time accuracy I can't always independently verify.

The Technical Reality

The hallucination problem in AI accessibility tools isn't a simple bug that future updates will fix—it's an inherent characteristic of how these systems work. Understanding why these failures occur helps explain why they require careful, situational use rather than blanket trust.

Current AI models are essentially sophisticated pattern-matching systems trained on massive datasets of text and images. They work by breaking down information into mathematical representations called "embeddings"—numerical patterns that capture relationships between different concepts. In an ideal system, the embedding for "red ball" would be identical whether the model reads the text "red ball" or sees an image of one.

However, achieving perfect alignment between different types of data—text, images, audio—is extraordinarily difficult. When this alignment fails, the model becomes confused about what it's actually perceiving. Since text data typically dominates training datasets, the model's vast textual knowledge can "overshadow" contradictory visual information.

This explains why AI tools might confidently misread a medicine bottle. The model recognizes visual patterns suggesting "this looks like a medicine label" and knows from its text training that medicine labels contain specific types of information. Rather than carefully analyzing the actual text in the image, it generates statistically plausible medical information based on what medicine labels typically say. The result is confident fabrication—not deliberate deception, but statistical prediction masquerading as visual analysis.

The conversational nature of tools like Gemini Live compounds this problem. Because these systems are trained to be helpful and engaging, they often make promises they can't keep—offering to "let you know when the bus arrives" without any mechanism for ongoing monitoring, or claiming to remember previous conversations when each interaction is actually independent.

This embedding misalignment also explains spatial reasoning failures. When an AI confidently tells you the power button is "on the left side" when it's actually on the right, it might be because the model's textual knowledge of "where power buttons are usually located" has overridden the specific visual evidence from your camera feed.

Understanding these limitations doesn't diminish the value of AI accessibility tools, but it helps explain why they require careful, situational use rather than blanket trust. The technology is genuinely impressive at pattern recognition, but it fundamentally lacks the contextual understanding and uncertainty awareness that humans bring to visual interpretation.

Conversation vs. Command

The shift from command-based to conversational AI interfaces represents perhaps the most significant usability improvement in accessible technology in recent years. Traditional accessibility tools often required learning specific commands or navigation patterns—double tap and hold here to get image descriptions, double-tap there to read text, swipe in a particular direction to change modes.

Gemini Live's conversational approach eliminates this cognitive overhead. Instead of remembering which mode or channel to use, you simply describe what you want to know: "What's the temperature setting on this thermostat?" or "Are there any cars coming from the left?" The system interprets your intent and provides relevant information without requiring you to categorize your request first.

This natural language interaction also enables iterative refinement that wasn't possible with earlier tools. If the initial response isn't quite what you need, you can follow up immediately: "Actually, I meant the lower display, not the upper one" or "Can you guide my finger to the right button?" The conversation flows naturally rather than requiring you to start over with a new image or command.

However, this conversational ease can mask the underlying technical limitations. The naturalness of the interaction encourages trust and dependency that the technology can't always justify. Users report feeling like they're talking to a knowledgeable human assistant, making the occasional confident errors more jarring and potentially dangerous.

The Promise and Perils of Real-Time Analysis

The move from static image analysis to real-time video processing represents both the greatest advancement and the most significant risk in current AI accessibility tools. Previous generations required users to capture specific images for analysis—a process that was cumbersome but encouraged deliberate, careful use of the technology.

Real-time systems like Gemini Live encourage more spontaneous, exploratory use. You can walk through an unfamiliar building while asking questions about your surroundings, or explore a new neighborhood while getting continuous descriptions of interesting architecture or local businesses. This spontaneity can dramatically expand independence and confidence in new environments.

However, real-time analysis also creates new opportunities for misunderstanding and error. The system might confidently describe something at the edge of the camera's view that you assume is directly in front of you, or provide navigation instructions based on visual elements that are partially obscured or ambiguous.

The speed of real-time interaction also doesn't allow for the careful verification that static image analysis encouraged. With Be My AI, users typically examined the AI's description carefully before acting on it. With conversational real-time tools, the natural flow of interaction can lead to acting on information before fully evaluating its reliability.

Looking Forward

The ultimate promise of AI-powered vision assistance lies not in smartphone apps but in wearable devices that provide continuous, contextual information about the surrounding environment. Companies like Envision, Meta, and others are developing smart glasses that combine computer vision, AI analysis, and discreet audio feedback to create what could become genuine vision prosthetics.

The appeal of this vision is obvious: hands-free operation, continuous environmental awareness, and the ability to receive information without the conspicuous action of pointing a smartphone at objects or people. Early demonstrations suggest these devices could provide real-time information about approaching vehicles, identify familiar faces in social settings, or read street signs and building numbers during navigation.

However, the current limitations of AI accuracy make this vision simultaneously exciting and concerning. If AI occasionally provides incorrect information when you're deliberately asking it to analyze something specific, how much more problematic might continuous, ambient AI assistance become? The very convenience that makes smart glasses appealing—automatic, ongoing analysis—also removes the deliberate verification step that current tools require.

The development of reliable AI-powered smart glasses will likely require advances not just in computer vision and processing power, but in uncertainty quantification—the ability for AI systems to accurately assess and communicate their confidence levels. A smart glasses system that could say "I think that's a stop sign, but I'm only 60% confident due to the lighting conditions" would be vastly more useful than one that confidently reports incorrect information.

Building Verification Strategies

Given the current state of AI accessibility tools, successful users develop sophisticated verification strategies that balance the convenience of AI assistance with the need for accurate information. However, there's often a significant gap between recommended verification practices and what users actually do in real-world situations.

Theoretical Best Practices include multiple tool verification, context-based evaluation, and systematic human backup systems. Organizations like the Royal National Institute of Blind People recommend using multiple AI tools to analyze the same visual information, looking for consistency across responses. If Be My AI, Seeing AI, and Gemini Live all identify the same medication and dosage, the consensus theoretically provides confidence.

Real-World Practice often looks quite different. Many users rely on informal peer verification. The reality is that many blind users cannot independently verify AI outputs, creating concerning dependency patterns that formal guidelines don't address. When you're alone with an unfamiliar appliance, theoretical verification strategies become less practical than hoped.

Progressive Trust Building involves starting with low-stakes applications where you can easily verify AI accuracy, gradually expanding use as you develop intuition about when and how the tools are likely to make mistakes. This approach helps users calibrate their trust appropriately for different types of visual analysis, though it requires extended experimentation that not all users have time, the willingness, or opportunity to pursue.

Human Backup Systems remain essential for critical information, but access to reliable human verification varies significantly based on social networks, economic resources, and geographic location. Successful AI users maintain access to Be My Eyes, Aira, or trusted friends and family for verification of important information, treating AI tools as first-line assistance rather than authoritative sources.

The challenge is that the same factors that make AI accessibility tools appealing—privacy, independence, availability—also make verification difficult. The goal isn't perfect verification but developing practical judgment about when AI assistance is sufficient and when additional confirmation is necessary.

Technology and Human Agency

The emergence of AI-powered accessibility tools raises broader questions about the relationship between technological assistance and human agency. These tools offer unprecedented independence in accessing visual information, but they also create new forms of dependency on systems that we don't fully understand or control.

The ideal relationship with these tools probably resembles the approach many people take with GPS navigation: using the technology for convenience and exploration while maintaining the underlying skills and awareness needed to function when the technology fails or provides incorrect information. For blind and low-vision users, this means continuing to develop traditional mobility and problem-solving skills while leveraging AI tools to expand possibilities and reduce barriers.

The goal isn't to achieve perfect AI-powered vision replacement—current technology is nowhere near that capability. Instead, the goal is to thoughtfully integrate AI assistance into a broader toolkit of strategies for navigating a visual world, understanding both the remarkable capabilities and significant limitations of current tools.

In Conclusion

My experience with Gemini Live and other AI accessibility tools over the past year has taught me that the future of AI-powered assistance lies not in replacing human judgment but in augmenting human capabilities while respecting human agency. These tools work best when they expand what we can explore and understand independently, rather than when we rely on them for critical decisions without verification.

The thermostat success story that opened this article represents AI accessibility at its best: providing immediate, convenient access to visual information in a low-stakes situation where accuracy can be easily verified. The dryer disaster represents AI at its most problematic: confidently providing incorrect information in an unfamiliar situation where verification was difficult.

Understanding the difference between these scenarios—and developing the judgment to navigate them appropriately—is perhaps the most important skill for anyone using AI accessibility tools. These technologies offer genuine benefits and genuine risks, often simultaneously.

As the technology continues to improve, the hope is that AI systems will develop better uncertainty quantification—the ability to express doubt when they're not confident rather than confidently providing incorrect information. Until that happens, the most successful approach involves treating AI as a powerful but fallible assistant that can dramatically expand your capabilities while respecting the need for human verification in critical situations.

The future I'm looking forward to isn't one where AI systems replace human judgment, but one where they provide reliable, accessible information that enables better human decision-making. We're not there yet, but the progress over the past two years suggests we're moving in the right direction.

For now, I’ll continue to use these tools extensively while maintaining healthy skepticism about their limitations. The independence they provide—even with caveats—represents a meaningful improvement in access to visual information than what we had even two years ago. The challenge is learning to use them wisely, recognizing both their remarkable capabilities and their significant limitations.

Trust, once lost, takes time to rebuild. But with careful use, clear understanding of limitations, and appropriate verification strategies, AI-powered accessibility tools can become valuable partners in navigating an increasingly visual world.

Sources

Be My Eyes. (2025, March 27). Introducing Be My AI (formerly Virtual Volunteer) for People who are Blind or Have Low Vision, Powered by OpenAI's GPT-4. Be My Eyes News. Retrieved from https://www.bemyeyes.com/news/introducing-be-my-ai-formerly-virtual-volunteer-for-people-who-are-blind-or-have-low-vision-powered-by-openais-gpt-4/
Primary source documenting Be My AI's technical foundation using GPT-4 Vision and its implementation as a static image analysis tool requiring separate uploads for follow-up questions.

Google. (2024, December 23). Gemini 2.0: Level Up Your Apps with Real-Time Multimodal Interactions. Google Developers Blog. Retrieved from https://developers.googleblog.com/en/gemini-2-0-level-up-your-apps-with-real-time-multimodal-interactions/
Technical documentation explaining Gemini Live's bidirectional streaming architecture and real-time video processing capabilities through WebSocket connections.

Google. (2025, May 20). Gemini App: 7 updates from Google I/O 2025. Google Blog. Retrieved from https://blog.google/products/gemini/gemini-app-updates-io-2025/
Official announcement confirming Gemini Live's camera and screen sharing features became available to all users for free, supporting the article's premise about widespread accessibility.

Google. (n.d.). Disability Innovation in the Workplace and Beyond. Google Belonging. Retrieved from https://belonging.google/in-products/disability-innovation/
Corporate documentation of Google Lookout's capabilities and Android-platform specificity, confirming the tool's continuous scanning functionality and platform limitations.

Guide Dogs UK. (n.d.). Apps to Help People with Vision Impairment. Guide Dogs Information and Advice. Retrieved from https://www.guidedogs.org.uk/getting-support/information-and-advice/how-can-technology-help-me/apps/
Professional accessibility organization's comparative evaluation of AI tools, providing context for the article's analysis of different approaches and their relative strengths.

Microsoft. (n.d.). Innovation and AI for Accessibility. Microsoft Accessibility. Retrieved from https://www.microsoft.com/en-us/accessibility/innovation
Official documentation of Seeing AI's capabilities as a multi-channel tool designed for the low vision community, confirming its specialized approach to different visual analysis tasks.

National Center for Biotechnology Information. (2024). Digital accessibility in the era of artificial intelligence—Bibliometric analysis and systematic review. PMC. Retrieved from https://pmc.ncbi.nlm.nih.gov/articles/PMC10905618/
Academic research providing context for AI reliability challenges in accessibility applications and the importance of accuracy in assistive technology contexts.

Perkins School for the Blind. (2023, December 26). Using the Envision app with low vision. Perkins Resources. Retrieved from https://www.perkins.org/resource/using-the-envision-app-with-low-vision/
Detailed user evaluation confirming Envision AI's cross-platform availability, multi-language support, and integration with optional smart glasses hardware.

Accessible Android. (2025, May 15). Gemini Live Camera Stream and Screen Sharing Impressions: Is Gemini Ready to Be the AI Assistant for the Blind? Accessible Android. Retrieved from https://accessibleandroid.com/gemini-live-camera-stream-and-screen-sharing-impressions-is-gemini-ready-to-be-the-ai-assistant-for-the-blind/
Independent user testing documentation providing specific examples of AI hallucination issues including medicine misidentification, brand name errors, and confident delivery of incorrect information that supports the article's reliability concerns.

Discussion about this post

Yasser Tamer Atef

Thank you for this piece, Kaylie.

You’ve managed to navigate a terrain that most commentary on assistive tech tends to flatten or evade entirely. The neutrality you maintain isn’t a lack of position; it’s a form of rhetorical clarity—an insistence that nuance can still speak truth, even (or especially) when the systems we’re assessing are built on asymmetries of power and access.

I want to speak briefly from my own context, which exists outside the dominant geographies where these tools are built, tested, and mythologized.

For example, conversational AI (Gemini in particular) reveals a striking unfamiliarity with cultural referents, political conditions, and linguistic nuance that originate beyond the Global North. The bias isn’t just embedded in data sets. It appears in the assumptions about what counts as intelligible input, what kinds of knowledge matter, and which users are worth designing for.

You noted its failures in pattern matching and evidence gathering. I’d go further: it performs a kind of epistemic triage. It sorts what it understands from what it dismisses, and in that sorting, entire histories, dialects, and lived experiences are misclassified as “errors” or “hallucinations.”

Apps like BeMyEyes offer a compelling use case. However, even here, locality becomes a determining factor. I’ve noticed that user pairings tend to cluster regionally. While this may increase relevance, it also reveals how global inclusion is often restricted by infrastructure and linguistic segmentation. It’s not necessarily a flaw in the design, but it does reinforce how “global” platforms often become regionally siloed in practice.

Here’s a more direct example: if I didn’t speak English fluently, I wouldn’t have been able to access something as seemingly inclusive as the Microsoft Disability Answer Desk. That’s not simply a technological limitation. It reflects a systemic failure in multilingual accessibility—one that conversational AI currently reproduces rather than resolves.

In the case of Gemini, it often renders other languages (Arabic as an example) nearly unintelligible unless explicitly prompted to use a specific variety—Egyptian colloquial Arabic, Gulf Arabic, and so on.

Even then, the result is inconsistent at best. What emerges is not a translation, but a disjointed linguistic approximation that occasionally slips into the absurd. It’s less assistance and more mimicry of language diversity, without genuine understanding.

Regarding document remediation, we are still far from where we need to be. Optical character recognition is no longer the main hurdle. The greater challenge is structuring inaccessible content into usable, navigable formats: complete with headings, sections, and semantic landmarks. To date, I haven’t found an AI model (ChatGPT, DeepSeek, or otherwise) that can perform this task reliably without an exhausting amount of manual prompting. By the time the output becomes usable, I could have already completed the task using Kurzweil or OpenBook.

These aren’t fringe critiques. They are central to the question of what it means for assistive technology to assist. If it doesn't work equitably, then it reproduces the very exclusions it claims to address.

Your writing doesn’t fall into this trap. Instead, it opens space for the deeper conversation that is so often missing—where access is not just a matter of interface, but of justice.

I’ve been commenting a lot (perhaps too frequently) but only because your posts make room for these thoughts. If my engagement ever feels overwhelming, I’m happy to take a step back. But if you're ever up for a deeper conversation, or if you need collaboration or support in any way, I’d be glad to connect.

Expand full comment

Kaylie L. Fox

What an excellent reply. Thank you so very much for your kind words. And believe me, you are not a bother in the slightest. I actually kind of wanted to delve into research around linguistics and AI, but I am unfortunately a plevian who only knows English as a native language and what languages I picked up here and there is hardly even usable. So I feel like I wouldn't be the person to properly dive into that topic in particular. But, that said, you do raise an interesting point. These models are being developed with the exception of deepseaq and English-speaking territories. Thus, it stands to reason that though there may be training data processing that covers other languages, it's not being intrinsically trained on those languages to be able to handle real actual conversation. You're not the first one to tell me about the linguist Xperia and Jim and I or even in other models. Would love to see a study done on this. Please feel free to reach out on Facebook and request me as a friend or message me. I would be delighted to talk with you more. We seem to share a lot. Ideological agreements. I will also say that from the position of writing these articles, the one thing that I strive to avoid is either oversimplification or the trap of playing up the hype of a product. I don't want to be one of those people that comes along and says AI is going to cure everything, because if you've read some of my other research, AI isn't going to fix all the systemic issues that we have before us, but in fact perpetuates them in some areas. Areas. I don't want to be one of those pro AI types that refuses to see the authentic and true challenges in front of us, especially in a disability field, as there is already so much careless talking going on. They completely undermines the point of inclusion and making sure that everybody's needs are met. Met. AI offers us a huge tool to help bridge a lot of of accessibility needs, but the technology is still in its infancy, despite how technically far we've come. I tried to remember that when I write things like this because it's honestly important to repeat it from time to time. These machines are not fallible even if they can generate good content. They're not right because they are trained on so much data. And they certainly aren't evolved enough to be able to demonstrate the ability to intrinsically pick up on a user's needs without explicitly being directed. Until we get to that point, I'm going to maintain my realistic approach to writing these things. Thank you again so much for your kind words, and if this message comes out disjointed just know it's because I'm using dictation right now and not typing. LOL as always, it's a pleasure engaging with you.