Youtube Transcriptions
Hugging FaceThe YouTube transcriptions dataset contains technical tutorials (currently from James Briggs, Daniel Bourke, and AI Coffee Break) transcribed using OpenAI's Whisper (large). Each row represents roughly a sentence-length chunk of text alongside the video URL and timestamp. Note that each item in the dataset contains just a short chunk of text. For most use cases you will likely need to merge multiple rows to create more substantial chunks of text, if you need to do that, this code snippet will… See the full description on the dataset page: https://huggingface.co/datasets/jamescalam/youtube-transcriptions.
Interesting queries to try
Columns
- title categorical
- published categorical
- url categorical
- video_id categorical
- channel_id categorical
- id text
- text text
- start numeric
- end numeric