Cool. Can you share more about the tech? Ingestion engine - is this a background task that scrapes the web? You download the mp4 and convert to transcripts then generate embeddings? For each embedding of an episode - how do you break it down? Summary the episode and embed it with metadata? Are you using pydantic ai for structured output? Celery for tasks? Just a curious dev
Ingestion engine, it is indeed a cron job that runs once a day to get the latest podcast episodes posted. Yes it scrapes the web for episodes and then populates the database. And yup yup, I transcribe the audio to text, and process the text to get the embeddings using embedding models. The secret sauce is using language models to find promising snippets within each episode by running a sliding window over the transcript. So I actually make different types of embeddings, for highlights and also for episodes. I also make use of the metadata in podcast episodes to enhance recommendations, mainly by deriving the strength of the source making the content.
You are spot on, I use celery for tasks, many different kinds of tasks actually, super handy tool to have, it truly enhances what I am able to do on Heroku. My devops life becomes much more comfy
Cool. Can you share more about the tech? Ingestion engine - is this a background task that scrapes the web? You download the mp4 and convert to transcripts then generate embeddings? For each embedding of an episode - how do you break it down? Summary the episode and embed it with metadata? Are you using pydantic ai for structured output? Celery for tasks? Just a curious dev
Ingestion engine, it is indeed a cron job that runs once a day to get the latest podcast episodes posted. Yes it scrapes the web for episodes and then populates the database. And yup yup, I transcribe the audio to text, and process the text to get the embeddings using embedding models. The secret sauce is using language models to find promising snippets within each episode by running a sliding window over the transcript. So I actually make different types of embeddings, for highlights and also for episodes. I also make use of the metadata in podcast episodes to enhance recommendations, mainly by deriving the strength of the source making the content.
You are spot on, I use celery for tasks, many different kinds of tasks actually, super handy tool to have, it truly enhances what I am able to do on Heroku. My devops life becomes much more comfy
[dead]