Product
What we learned from a million voice captures
Thoughts are shorter and messier than people expect. Here is what the data shows about how ADHD brains actually use voice.
After the first million voice captures, a pattern emerged that no usability test would have predicted. Thoughts are shorter and messier than people expect, and the messiness is the point.
> The median capture is under twelve seconds. People do not speak in tasks. They speak in worries, obligations, and half-plans. A capture tool that requires complete sentences is a capture tool nobody uses.
## What the data actually says
The median voice capture is 11.4 seconds. The 75th percentile is 18 seconds. The 90th percentile is 31 seconds. Less than two percent of captures exceed sixty seconds. People do not record monologues. They record fragments.
The fragments fall into a small number of patterns. The most common (about 38% of captures) is what we call a half-plan: "tomorrow I need to call the dentist about the thing." There is a task in there, but it is wrapped in context the user has not yet sorted. The second most common (about 23%) is a worry capture: "I think I forgot to send the proposal to Marek." There may or may not be a task — the user is asking themselves a question, externalizing a fear so they can stop carrying it. The third common pattern (about 17%) is what we call an obligation reminder: "remind me to bring the charger." Short, imperative, and structurally complete. The remaining captures are a long tail of edge cases — recipe ideas, gift lists, half-articulated decisions about a relationship, fragments of dreams.
The architectural implication is clear. A capture system optimized for the median user must handle ambiguity natively. It cannot assume each capture maps to one task with one due date. It must be able to say "I think this is a half-plan; here are the two tasks I extracted; you can confirm or fix" without making the user feel interrupted.
What surprised us most was the temporal clustering. ADHD brains do not capture evenly throughout the day. Roughly 40% of all captures happen in two narrow windows: the first fifteen minutes after waking and the last thirty minutes before sleep. These are the moments when the executive system is either ramping up or winding down, and unprocessed thoughts surface with urgency. The morning cluster tends toward obligation reminders — things the brain held overnight and needs to externalize before the day begins. The evening cluster skews toward worry captures — loose threads the brain cannot release without somewhere to put them. Designing for these two windows means optimizing for speed and low friction above all else, because the user is either half-asleep or trying to fall asleep.
We also found that repeat captures are far more common than expected. About 12% of all captures are semantically identical to something the user recorded in the previous seven days. The same worry, the same half-plan, the same obligation — captured again because the user forgot they already captured it, or because the previous capture did not resolve into action. This is not a failure of the user; it is a signal that the loop between capture and action was not closed tightly enough. Our deduplication layer now flags these gently: "You captured something similar on Tuesday — want to see it?" This reduces noise without making the user feel surveilled.
## Why sorting matters more than storage
That is why sorting matters more than storage. The product is not a notebook. It is a translator from brain-noise to next steps.
Storage is easy. Any voice memo app stores audio. The hard problem is the gap between the saved file and the moment you actually do something about what was inside it. People with ADHD often have folders of voice memos they have never replayed. The capture happened; the closing of the loop did not.
KeptMind's job is to close the loop automatically. The voice goes in, the structured task comes out, and the original audio is no longer required. The user did not record the voice in order to keep the audio. They recorded the voice in order to externalize the thought. Once the thought is externalized as a task, the audio is essentially scaffolding — useful during processing, unnecessary afterward.
This reframing changed how we built the product. We are not a voice memo app with task features. We are a task app with voice input. The audio is intermediate state, not stored content.
## The hard part: what counts as a task
We spent a long time on the AI layer that converts rambles into actions. The hard part is not transcription — modern transcription quality is excellent. The hard part is identifying what actually needs to happen next, and at what energy level, and whether it belongs in Today or Later.
Consider the sentence: "I should probably take a look at the budget thing before Marek's thing on Friday, but I don't know if that's realistic this week." A naive parser sees a task ("look at the budget"), a date ("Friday"), and a context ("Marek's thing"). A useful parser also notices the hedging language ("probably," "I don't know if that's realistic"), which signals low confidence on the user's part. A really useful parser asks one disambiguating question instead of guessing — "Should I add this for Wednesday so it's done before Friday, or save it as a maybe?" — and learns from the answer.
Most of our model work has been on three subproblems: extracting the task from the surrounding context, estimating its energy cost so it sorts correctly into your day, and deciding when to ask a follow-up question versus when to commit to a guess. None of these are pure transcription problems, and none of them are obvious from a feature spec.
One pattern we did not anticipate was the emotional preamble. About one in five captures begins with a sigh, a self-deprecating remark, or a verbal tic like "okay so" or "ugh, I keep forgetting." These preambles carry no task content but they carry emotional signal. Early versions of our parser would sometimes interpret "I keep forgetting to call the dentist" as two items — a self-observation and a task. The current model strips the emotional scaffolding and extracts only the actionable core, but it logs the emotional tone as metadata that informs energy estimation. A capture that begins with frustration is more likely to represent a task the user has been avoiding, which means it may need a gentler nudge schedule or a lower energy-cost estimate to avoid further avoidance.
## The twelve-second constraint
The twelve-second constraint is not arbitrary. It is the window where working memory still holds the context that made the thought feel urgent. Longer clips lose the why even when they save the what.
We tested capture lengths between five seconds and two minutes. Captures under twelve seconds were three times more likely to be acted on within twenty-four hours. Captures over thirty seconds were twice as likely to be deleted unread. The longer the recording, the less likely the user was to remember why they recorded it. The brain that produced the urgency had already moved on.
This finding shaped the UI. We do not encourage long captures. The capture button shows a soft visual cue at twelve seconds — not a hard cutoff, just a gentle "you have what you need now." Users can keep going if the thought is genuinely complex, but the default behavior nudges toward the productive zone.
## What people capture, by category
About half of all captures land in the personal category — appointments, errands, health-related items, family logistics. About 35% are work — tasks, follow-ups, half-finished decisions. The rest is a long tail: ideas, journal-like reflections, things that turned out not to be tasks at all.
The personal/work ratio matters because most productivity tools are designed for work and tolerate personal use. KeptMind is designed for the boundary between the two — for the parent who has thirty seconds between meetings to remember to pick up the prescription. The use case is not "manage my projects." It is "do not lose this one."
## Frequently asked questions
### How long is a typical capture?
Median under twelve seconds. People speak in half-plans, not polished tasks. The system is designed for this.
### What if I need to record something longer?
You can — there is no hard cap. We just nudge against it because longer captures historically get deleted without action. If you regularly need long-form capture, the typed brain dump flow is better suited.
### How accurate is the parsing?
On clean speech, task extraction is correct on the first try about 88% of the time, and asks one disambiguating question another 8% of the time. The remaining 4% are genuine misreads, which is why the parsed task always shows for review before it goes into Today.
### Does it work in noisy environments?
Reasonably well — modern speech models are robust to background noise up to a point. Cars, cafes, and walking on a street are usually fine. Crowded bars and gyms with loud music are harder. If transcription quality drops, we tell you and let you re-record.
### What languages are supported?
English at full quality on launch, with regional accent support across UK, US, Irish, Australian, South Asian, and African English. We chose to ship one language deeply rather than many shallowly.
## Related reading
If this article was useful, these related guides cover adjacent ground and are worth reading next:
- [ADHD Productivity Apps 2026](/blog/adhd-productivity-apps-2026) - [Voice To Task ADHD Guide](/blog/voice-to-task-adhd-guide) - [Executive Dysfunction ADHD Guide](/blog/executive-dysfunction-adhd-guide)
Each of the linked articles approaches the topic from a slightly different angle, and reading two or three of them together usually produces a more complete picture than any single article can. The shared underlying neurology means that improvements in one area often unlock progress in others, which is why the topics interconnect even when they appear separate at first glance.
How long is a typical capture?
Median under twelve seconds. People speak in half-plans, not polished tasks. The system is designed for this.
What if I need to record something longer?
You can — there is no hard cap. We just nudge against it because longer captures historically get deleted without action. If you regularly need long-form capture, the typed brain dump flow is better suited.
How accurate is the parsing?
On clean speech, task extraction is correct on the first try about 88% of the time, and asks one disambiguating question another 8% of the time. The remaining 4% are genuine misreads, which is why the parsed task always shows for review before it goes into Today.
Does it work in noisy environments?
Reasonably well — modern speech models are robust to background noise up to a point. Cars, cafes, and walking on a street are usually fine. Crowded bars and gyms with loud music are harder. If transcription quality drops, we tell you and let you re-record.
What languages are supported?
English at full quality on launch, with regional accent support across UK, US, Irish, Australian, South Asian, and African English. We chose to ship one language deeply rather than many shallowly.
