Abstract: A brief yet comprehensive summary of a lengthy video helps us understand the key insights about it. Video summarization aims to generate a ‘video-thumbnail' from a given input video. Although the field has been widely studied in the literature, to the best of our knowledge, all the existing works in this area have majorly emphasized on visual modality only, although its audio component may carry crucial information for efficient video summarization. To this end, we introduce a novel self-supervised audio-visual summarization network AudViSum that leverages both audio and visual information and employs Deep Reinforcement Learning to reward the model to generate diverse yet semantically meaningful summaries. Our experiments establish the fact that combining audio-visual information helps to generate realistic summaries from relatively lengthy input videos. To ensure diverse summary generation we report the top-3 summaries for each video. Since there is no publicly available annotation to evaluate audio-visual summaries, we annotate the TVSum & OVP datasets comprising 50 videos each. Experimental results indicate that AudViSum achieves promising performance in the audio-visual summary generation task when compared against human annotations.