Voice Is Eating the Prompt: How Ordinary People Will Talk to AI Next

For the last two years, the default image of using AI has looked something like this:

someone sitting in front of a text box, carefully writing a prompt as if they were briefing a very literal genie.

That image is not wrong. In fact, it defined an entire phase of AI product behavior. But lately I have felt more and more strongly that this interface is starting to loosen. Prompting is not suddenly useless. It is simply moving into the background, while voice, screenshots, screen recordings, and raw documents are slowly becoming the more natural front door.

I am not saying that because one company made a flashy claim. I am saying it because today’s signals line up unusually well.

Sam Altman has said that voice models will change the way people interact with AI. Product Hunt’s No. 2 launch today, Velo 2.0, is doing something very concrete with that direction by turning your voice and your screen into a shareable video almost instantly. On the technical side, Hacker News is discussing GLM-5V-Turbo as a model aimed at multimodal agents, while GBrain is expanding toward multimodal embeddings, photo OCR, and EXIF extraction.

Taken together, the picture feels pretty clear to me.

The next dividing line in AI use will not be who can write the best prompts. It will be who can hand over intent more naturally.

Prompts are not disappearing, but they are a weak mainstream interface

I have never fully liked the idea that prompt engineering is the long-term human-computer interface.

It is true for a certain kind of power user. It is much less true for everyone else.

The reason is simple. Writing a good prompt still requires structured expression. You need to know what you want, break it down, provide context, and patiently refine it when the first answer is off.

For engineers, researchers, and some creators, that is manageable. For ordinary users, it is friction.

A lot of people do not lack intent. They just do not want to learn a ritual for talking to a machine.

That is why prompting has always felt to me like a transitional interface. It is powerful, but it is not naturally comfortable. It behaves a little like the command line. Skilled people love it. Most people would rather click, speak, drag, drop, and move on.

So the interfaces that will truly broaden AI adoption are less likely to come from ever more elaborate prompting tricks and more likely to come from lower-friction forms of input.

Why voice will eat part of prompting first

Voice wins not because it looks futuristic, but because it is cheaper.

Cheaper in what sense? It reduces the cost of typing, the cost of organizing language, and the cost of feeling like you need to fully think before you speak.

A lot of needs are not hard to express. They are just annoying to compress into neat text.

That becomes obvious in moments like these:

when you want to capture an idea quickly
when you want AI to summarize a meeting
when you want to explain what feels wrong on a page
when you want to replay why a workflow failed
when you want to dump a half-formed thought before it fades

In those moments, voice sits closer to natural human behavior than text does.

Text asks you to organize first and express second. Voice lets you think while speaking.

That difference matters more than it sounds. A surprising number of products do not grow because their capability jumps by a full level. They grow because the action cost drops by half a level.

When Sam Altman says voice models will change human-AI interaction, I do not think that is hype. The change is not only about benchmark performance. It is about whether a person feels willing to use the system one more time.

And for many products, life or death really does come down to that one more time.

Products like Velo 2.0 are really changing the expression layer

What I care about in Velo 2.0 is not only the feature description that it can turn your voice and screen into video.

What matters more is the deeper shift underneath it.

Velo compresses what used to be a more fragmented process of recording, narrating, editing, and packaging into a much smoother action. You speak, you demonstrate, you record the screen, and the system helps turn that into content others can consume.

That sounds like a creator-tool upgrade, but I think it is actually rewriting something more basic.

More AI products are no longer asking users to translate their thoughts into prompts first. They are letting behavior itself become input.

You speak.

You move the cursor.

You switch windows.

You highlight something.

You throw in raw material.

Those used to be just operational traces. Now they are becoming interpretable input.

That is a big deal because ordinary users were never best at writing formal prompts. They were always better at simply doing things.

If a system can understand action, visuals, and tone, it becomes much friendlier to the mass market.

Multimodality is not a feature add-on. It is an interface shift

A lot of people still read multimodal updates as one more capability on the model checklist.

I think that understates what is happening.

The real significance of multimodality is not merely that a model can see images, hear audio, or read OCR text. It is that users can increasingly express intent in rougher, more natural ways.

The GLM-5V-Turbo discussion mentioned in today’s report points directly toward multimodal agents. GBrain’s updates around photo OCR, EXIF extraction, and multimodal embeddings point in the same direction. The industry is preparing not just for flashy demos, but for systems that can absorb messier, more realistic forms of input.

And real-world input is messy by default.

It might be a screenshot.

It might be a spoken complaint.

It might be a screen recording.

It might be a few product photos.

It might be a PDF you never bothered to rename.

The companies that push AI into broader everyday use will not be the ones with the fanciest prompt templates. They will be the ones that can reliably catch all this messy material and still get useful work done.

How ordinary people will talk to AI next

If I had to make a direct bet, it would be this:

people will increasingly interact with AI the way they collaborate with an assistant who can see the scene, hear the context, and remember the task, rather than the way they currently write to a blank input box.

More concretely, I expect a few changes.

First, people will speak first and structure later

Today many users still feel like they need to mentally clean up a request before typing it.

Tomorrow the more common pattern will be to say the rough thought first and let AI organize it.

In other words, AI will increasingly accept the rough draft before producing the structured request.

Second, people will provide context before instructions

Instead of typing, “Analyze why this page converts poorly,” users will increasingly send a screen recording, a heatmap screenshot, and a voice note that says, “Look here, this is probably where users drop off.”

That is much closer to real communication.

Third, input will become mixed by default

Text, voice, images, documents, webpages, and live screen state will start blending together.

Users will not care whether that counts as “proper input.” They will only care whether the system can take all of it and still get the job done.

Fourth, prompting skill will become backend infrastructure

This is the part I find most interesting.

Prompting is not going away. It is just going to hide behind the product.

The strongest products will not require users to write good prompts. They will generate the prompts, structure, context, and tool-routing logic on the user’s behalf.

Put more bluntly, prompt engineering is not dying. It is becoming productized, automated, and increasingly invisible.

Ordinary users will not need to obsess over the perfect wording because the system will perform that translation for them.

Why this matters right now

Because when a trend is just beginning to take shape, judgment is worth more than consensus.

Once everyone is repeating that voice is the future, the idea becomes less interesting. Right now the moment is better because you can still see both the promise and the awkwardness.

The promise is obvious. Input friction is genuinely falling.

The awkwardness is that many products have only added a microphone button without redesigning the interaction model underneath. They look more natural, but internally they still expect the user to speak as if they were dictating a polished prompt.

That is fake progress.

If voice is really going to eat prompting, it will not happen because products add one more icon to the UI. It will happen because systems become comfortable with messier, more conversational, less complete input.

They will need to ask follow-up questions, summarize, correct misunderstandings, and infer context from screenshots, tone, and artifacts. Otherwise voice interaction is just typing by mouth.

One last thought

If you are still studying prompting seriously today, that is completely fine. For heavy users it remains useful, and it is not disappearing anytime soon.

But if you ask me what the most common AI interface for ordinary people will be over the next three years, I would not bet on longer, denser, more formatting-sensitive prompts.

I would bet on a different picture.

You say one sentence out loud, snap a screenshot, record a short clip, throw in a few raw materials, and AI actually gets what you mean.

At that point the prompt still exists, but it is no longer center stage.

That direction, honestly, I like.

References

Sam Altman-related signal as collected in the 2026-05-06 tech radar report Follow Builders section: https://x.com/sama/status/2051318922805436896
Product Hunt, Velo 2.0: “Instantly turn your voice and screen into shareable videos”: https://www.producthunt.com/posts/velo-2-0
arXiv, GLM-5V-Turbo: “Toward a Native Foundation Model for Multimodal Agents”: https://arxiv.org/abs/2604.26752

Prompts are not disappearing, but they are a weak mainstream interface#

Why voice will eat part of prompting first#

Products like Velo 2.0 are really changing the expression layer#

Multimodality is not a feature add-on. It is an interface shift#

How ordinary people will talk to AI next#

First, people will speak first and structure later#

Second, people will provide context before instructions#

Third, input will become mixed by default#

Fourth, prompting skill will become backend infrastructure#

Why this matters right now#

One last thought#

References#