close
Skip to main content
BERJAYA

r/difyai



Build your online store and start selling today. No tech skills needed.
media poster



Dataset using YT/Podcast Transcripts
Dataset using YT/Podcast Transcripts
Dataset using YT/Podcast Transcripts

Hi everyone,

I am new at RAG systems and have a little problem. I am building a Q&A RAG system and my dataset is mostly youtube podcast transcripts. Despite adding more data and advanced pipeline the system cannot retrieve specific informations (e.g., analyses about specific companies or products mentioned in the podcasts). Mostly it says there is nothing about it in context or gives very shallow answers.

My current stack is.

I use Dify for the workflow

Data Prep: Raw YouTube transcripts. I used GPT-4o-mini to to generate summaries, and extract metadata tags for each file. And I add each metadata to dify.

Chunking: 1500 chunk size with 250 overlap.

Embedding: OpenAI text-embedding-3-large.

Retrieval Strategy: 2-pass retrieval. One search directly with the user's prompt, and another search where an LLM transforms/expands the prompt. I combine the results.

Generator LLM: DeepSeek R1.

Has anyone tackled retriaval from conversational/podcast data? Is there any recommendations? Thanks!

upvote comments