I’ve always hated the process of reporting during a pentest, but as of late I’ve come to realize that those were not my own thoughts.

Too many times as I was trying to get into cyber security, I read about how reporting is the worst part of a pentest. I adopted that ideology, without ever thinking through it.

Over time I have come to enjoy writing. Putting my thoughts on paper and expressing the intricacies of my methodology into digestible business focused impact for the less technical folks is actually enjoyable. As well as keeping the technical content adaptable, readable, and digestible by the technical folks that may not have context of what was going on during the pentest or why it was done to begin with.

However, reporting is not the easiest thing and it is very time-consuming, especially if it is a long engagement with lots of findings. Because of all the typing, I have developed carpal tunnel syndrome in my wrist, which is an absolute pain.

That is one reason that I have a split keyboard, which has made things way easier but typing is still time-consuming. In an effort to adopt AI into my workflows, inevitably so, I have been using Wispr Flow in my personal work but I can’t do that in my day-to-day job considering the privacy concerns of working in pentest environments where people do not want their data touched by AI. Commercial AI to be exact; the OpenAIs and Claudes of this world.

As with anything in life there are people on both sides of the fence: those that have embraced AI and gladly give all their data to BigAI, as well as those on the other side of the fence that will not touch AI with a ten-foot pole.

Anyways back to reporting. Since I discovered Wispr Flow and I’ve been using it for my personal workflows, I was wondering how I could do something similar offline with local models.

I think I have come up with a pretty janky v.01 that has been working for my needs. Meet Taura, your local dictation assistant for all your reporting needs.

You can find it here.

My local setup is Ubuntu running i3. This is the only setup I’ve tested this on so far and it is currently adapted for but you can adapt it to your needs as you see fit. I’ve also only used the “small” model (described in the repo readme), which is a good balance between speed and accuracy.

The current frustration is that it does not do well with technical words and files paths, such as SHA1 or /etc/passwd so I will test the more resource intense models to see which one does best for my needs.

Here is an example of the ouput from Taura that covers abbreviations, numbers, punctuation, file paths, and so on. This first passage is the control passage:

Control Passage

The transcription server loads the Whisper model into RAM at startup and keeps it resident across requests. When the user holds Control plus Shift plus Space, the listener begins recording via sox at sixteen kilohertz, mono channel, sixteen-bit signed integer PCM. Upon release, the resulting WAV file path is sent through a Unix domain socket at tilde slash dot dictation dot sock using socat.

The server decodes the audio, applies beam search with a beam size of five, and returns the top hypothesis as plain UTF-8 text. Finally, xdotool types the result into whatever window currently holds focus — whether that’s a terminal, a browser, or a code editor like VS Code.

Latency depends on utterance length and model size: “tiny” runs in under half a second on most CPUs, while “large-v3” may take three to five seconds without GPU acceleration via CUDA or cuDNN.

This second passage is from Taura:

Dictation Passage

The transcription server loads the whisper model into RAM at startup and keeps it resident across requests. When the user holds control plus shift plus space, the listener begins recording via socks at 16 kHz, mono-channel 16-bit signed integer PCM. Upon release, the resulting WAV file path is sent through a Unix domain socket at title slash dot dictation dot sock using Socat. The server decodes the audio, applies beam search with a beam size of 5, and returns the top hypotheses as plain UTF-8 text. Finally, the XDO tool types the result into whatever window currently holds focus, whether that’s a terminal, a browser, or a code editor like VS Code or Sublime, latency depends on utterance length and model size. Tiny runs in under half a second and most CPUs dwell. Large v3 may take 3-5 seconds without GPU acceleration via CUDA or CU-DNN.

Cheers.

References: