azure-ai/README.md

azure-ai
============================

Azure AI playground and notes

# Speech to text

## Recording
Recording directly from mic seems to work out of the box with very little debugging.

![Direct recording](note-assets/speec-to-text-recording.png)

## Uploading yt-dlp extracted audio
Tried to upload yt-dlp downloaded video audio for speech to text but was met with problems with audio format. Here's the steps I took:

1. yt-dlp [url]
2. extract audio with ffmpeg `ffmpeg -i [file].webm -q:a 0 -map a [file].mp3`
3. Upload mp3 failed
4. Convert mp3 to wav via ffmpeg `ffmpeg -i quartering.webm.mp3 -acodec pcm_u8 -ar 22050 quartering.webm.wav`
5. No dice
6. Then tried exporting the audio via audacity
7. Still the same error

![Error message](note-assets/speech-to-text-error-file-format.png)

### Seems to be very picky with file formatting

Based on this article https://www.unimelb.edu.au/accessibility/automatic-speech-recognition/getting-started-with-microsoft-azure-speech-to-text it seems that audio needs to be in a very specific format.

"The out of the box speech-to-text Service is available for quick real-time Speech-to-text service and transcription of WAV audio file(s) (16kHz or 8kHz, 16-bit, and mono PCM)."

By the way, official documentation is remarkably mum about this requirement.

Anyway, let's try converting again.

```bash
ffmpeg -i q.mp3 -acodec pcm_s16le -ac 1 -ar 16000 q3.wav 
```

![Working speech to text](note-assets/speech-to-text-working-wav.png)

## Language switching

Let's switch to Finnish and try this again.

As a source data we use a video in an article https://yle.fi/a/74-20080518. Audio
is recorded with audacity and then exported as wav.

![Audacity recording already records in the proper format and exports correctly](note-assets/speech-to-text-finnish-audacity.png)

Finnish is notoriously difficult language to learn (or so I've heard) and my experiences with various translation solutions have left absolutely more to be desired. Here's the result of the small news clip.

![Amazing results as far as accuracy is concerned](note-assets/speech-to-text-finnish-results.png)

I would say these results are amazing as far as accuracy is concerned in comparison to other solutions even fiveish years ago. Granted, I haven't had the need to do anything like this so maybe I am hyping over nothing but still, pretty good.


# Computer vision

So let's try see if computer vision can read a receipt.

![Not bad](note-assets/vision-api-ocr-receipt.png)
Speec to text experimentation 2024-03-28 09:13:14 +00:00			`azure-ai`
			`============================`

Initial commit 2024-03-28 08:01:07 +00:00			`Azure AI playground and notes`
Speec to text experimentation 2024-03-28 09:13:14 +00:00
			`# Speech to text`

			`## Recording`
			`Recording directly from mic seems to work out of the box with very little debugging.`

			`![Direct recording](note-assets/speec-to-text-recording.png)`

			`## Uploading yt-dlp extracted audio`
			`Tried to upload yt-dlp downloaded video audio for speech to text but was met with problems with audio format. Here's the steps I took:`

			`1. yt-dlp [url]`
			2. extract audio with ffmpeg `ffmpeg -i [file].webm -q:a 0 -map a [file].mp3`
			`3. Upload mp3 failed`
			4. Convert mp3 to wav via ffmpeg `ffmpeg -i quartering.webm.mp3 -acodec pcm_u8 -ar 22050 quartering.webm.wav`
			`5. No dice`
			`6. Then tried exporting the audio via audacity`
			`7. Still the same error`

			`![Error message](note-assets/speech-to-text-error-file-format.png)`

			`### Seems to be very picky with file formatting`

			`Based on this article https://www.unimelb.edu.au/accessibility/automatic-speech-recognition/getting-started-with-microsoft-azure-speech-to-text it seems that audio needs to be in a very specific format.`

			`"The out of the box speech-to-text Service is available for quick real-time Speech-to-text service and transcription of WAV audio file(s) (16kHz or 8kHz, 16-bit, and mono PCM)."`

			`By the way, official documentation is remarkably mum about this requirement.`

			`Anyway, let's try converting again.`

			```bash
			`ffmpeg -i q.mp3 -acodec pcm_s16le -ac 1 -ar 16000 q3.wav`
			```

			`![Working speech to text](note-assets/speech-to-text-working-wav.png)`

			`## Language switching`

			`Let's switch to Finnish and try this again.`

			`As a source data we use a video in an article https://yle.fi/a/74-20080518. Audio`
			`is recorded with audacity and then exported as wav.`

			`![Audacity recording already records in the proper format and exports correctly](note-assets/speech-to-text-finnish-audacity.png)`

			`Finnish is notoriously difficult language to learn (or so I've heard) and my experiences with various translation solutions have left absolutely more to be desired. Here's the result of the small news clip.`

			`![Amazing results as far as accuracy is concerned](note-assets/speech-to-text-finnish-results.png)`

			`I would say these results are amazing as far as accuracy is concerned in comparison to other solutions even fiveish years ago. Granted, I haven't had the need to do anything like this so maybe I am hyping over nothing but still, pretty good.`

Add project to test transcribe functionalities 2024-05-08 07:21:37 +00:00
			`# Computer vision`

			`So let's try see if computer vision can read a receipt.`

			`![Not bad](note-assets/vision-api-ocr-receipt.png)`