Video captioning fashions are sometimes skilled on datasets consisting of brief movies, often beneath three minutes in size, paired with corresponding captions. Whereas this allows them to explain primary actions like strolling or speaking, these fashions wrestle with the complexity of long-form movies, akin to vlogs, sports activities occasions, and flicks that may final over an hour. When utilized to such movies, they usually generate fragmented descriptions centered on remoted actions moderately than capturing the broader storyline. Efforts like MA-LMM and LaViLa have prolonged video captioning to 10-minute clips utilizing LLMs, however hour-long movies stay a problem as a result of a scarcity of appropriate datasets. Though Ego4D launched a big dataset of hour-long movies, its first-person perspective limits its broader applicability. Video ReCap addressed this hole by coaching on hour-long movies with multi-granularity annotations, but this method is dear and vulnerable to annotation inconsistencies. In distinction, annotated short-form video datasets are extensively obtainable and extra user-friendly.
Developments in visual-language fashions have considerably enhanced the combination of imaginative and prescient and language duties, with early works akin to CLIP and ALIGN laying the muse. Subsequent fashions, akin to LLaVA and MiniGPT-4, prolonged these capabilities to pictures, whereas others tailored them for video understanding by specializing in temporal sequence modeling and setting up extra strong datasets. Regardless of these developments, the shortage of huge, annotated long-form video datasets stays a major hindrance to progress. Conventional short-form video duties, like video query answering, captioning, and grounding, primarily require spatial or temporal understanding, whereas summarizing hour-long movies calls for figuring out key frames amidst substantial redundancy. Whereas some fashions, akin to LongVA and LLaVA-Video, can carry out VQA on lengthy movies, they wrestle with summarization duties as a result of information limitations.
Researchers from Queen Mary College and Spotify introduce ViSMaP, an unsupervised methodology for summarising hour-long movies with out requiring pricey annotations. Conventional fashions carry out properly on brief, pre-segmented movies however wrestle with longer content material the place vital occasions are scattered. ViSMaP bridges this hole by utilizing LLMs and a meta-prompting technique to iteratively generate and refine pseudo-summaries from clip descriptions created by short-form video fashions. The method includes three LLMs working in sequence for era, analysis, and immediate optimisation. ViSMaP achieves efficiency comparable to totally supervised fashions throughout a number of datasets whereas sustaining area adaptability and eliminating the necessity for in depth guide labelling.
The research addresses cross-domain video summarization by coaching on a labelled short-form video dataset and adapting to unlabelled, hour-long movies from a special area. Initially, a mannequin is skilled to summarize 3-minute movies utilizing TimeSFormer options, a visual-language alignment module, and a textual content decoder, optimized by cross-entropy and contrastive losses. To deal with longer movies, they’re segmented into 3-minute clips, and pseudo-captions are generated. An iterative meta-prompting method with a number of LLMs (generator, evaluator, optimizer) refines summaries. Lastly, the mannequin is fine-tuned on these pseudo-summaries utilizing a symmetric cross-entropy loss to handle noisy labels and enhance adaptation.
The research evaluates VisMaP throughout three situations: summarization of lengthy movies utilizing Ego4D-HCap, cross-domain generalization on MSRVTT, MSVD, and YouCook2 datasets, and adaptation to brief movies utilizing EgoSchema. VisMaP, skilled on hour-long movies, is in contrast towards supervised and zero-shot strategies, akin to Video ReCap and LaViLa+GPT3.5, demonstrating aggressive or superior efficiency with out supervision. Evaluations use CIDEr, ROUGE-L, METEOR scores, and QA accuracy. Ablation research spotlight the advantages of meta-prompting and element modules, akin to contrastive studying and SCE loss. Implementation particulars embody the usage of TimeSformer, DistilBERT, and GPT-2, with coaching carried out on an NVIDIA A100 GPU.
In conclusion, ViSMaP is an unsupervised method for summarizing lengthy movies by using annotated short-video datasets and a meta-prompting technique. It first creates high-quality summaries by means of meta-prompting after which trains a summarization mannequin, lowering the necessity for in depth annotations. Experimental outcomes reveal that ViSMaP performs on par with totally supervised strategies and adapts successfully throughout varied video datasets. Nonetheless, its reliance on pseudo labels from a source-domain mannequin could influence efficiency beneath vital area shifts. Moreover, ViSMaP presently depends solely on visible data. Future work may combine multimodal information, introduce hierarchical summarization, and develop extra generalizable meta-prompting methods.
Try the Paper. Additionally, don’t overlook to comply with us on Twitter and be part of our Telegram Channel and LinkedIn Group. Don’t Overlook to hitch our 90k+ ML SubReddit.
Sana Hassan, a consulting intern at Marktechpost and dual-degree scholar at IIT Madras, is keen about making use of know-how and AI to deal with real-world challenges. With a eager curiosity in fixing sensible issues, he brings a contemporary perspective to the intersection of AI and real-life options.