YouTube Video Analyzer — Multimodal This skill performs deep analysis of YouTube videos through both information channels : - Audio channel : Transcript with timestamps (what is SAID) - Visual channel : Frame extraction + image analysis (what is SHOWN) Most YouTube skills only extract transcripts. This skill closes the gap by synchronizing visual frames with spoken content , enabling accurate step-by-step guides where "click the blue button" is matched with the actual screenshot showing which button. Workflow Overview Step 1: Setup Working Directory Step 2: Get Video Metadata This returns thr…