ICASSP 2007 - April 15-20, 2007 - Honolulu, Hawai'i, U.S.A.

TUT-16: Automatic Spoken Document Processing for Retrieval and Browsing

Monday Afternoon, April 16
14:00 - 17:00
Room 325A

Presented by

Ciprian Chelba, Google and T. J. Hazen, MIT Computer Science and Artificial Intelligence Lab.

Abstract

Ever increasing computing power and connectivity bandwidth together with falling storage costs result in an overwhelming amount of data of various types being produced, exchanged, and stored. Consequently, search emerges as a key application as more and more data is being saved.

Speech search has not received much attention due to the fact that large collections of untranscribed spoken material have not been available, mostly due to storage constraints. As storage becomes cheaper, the availability and usefulness of large collections of spoken documents is limited strictly by the lack of adequate technology to exploit them. Manually transcribing speech is expensive and sometimes outright impossible due to privacy concerns. This leads us to exploring an automatic approach to searching and navigating spoken document collections.

This tutorial will present an overview of speech transcription, indexing, and search technologies for spoken documents, with an emphasis on a corpus containing recorded academic lectures. The tutorial will point out general problems in this area and suggest possible solutions. Included in the tutorial will be a discussion of scenarios and previous projects in the area of spoken document retrieval, issues of automatic transcription of long audio files (including vocabulary coverage and the out-of-vocabulary problem, and adaptation of speech recognition models), and techniques for the indexing and retrieval of spoken audio files (including a review of text retrieval techniques, automatic speech recognition lattice techniques, query processing, relevance scoring and evaluation). Time permitting a discussion of user interface issues for browsing audio files will also be included.

An outline of the tutorial is as follows:

  1. Introduction: Scenarios/Previous Work/Corpora (20 minutes)
    1. Scenarios:
      1. some economic considerations for viability of such technology (it costs $100/hr to transcribe speech)
      2. scenarios where it is not expected to be useful: video broadcasts (closed captioned by FCC requirements), ad revenue might pay for transcription
      3. scenarios where it is expected to be useful: lectures, podcasts, call center data mining, surveillance
    2. Broadcast news (present primarily as background material)
      1. Characteristics
      2. Meta-data annotation
      3. Past work (HP SpeechBot, BBN, TREC, PodZinger, etc.)
    3. Academic & Scientific Lectures (primary driver for rest of talk)
      1. Examples (OCW, CSJ, MICASE)
      2. Characteristics
      3. Challenges and opportunities
  2. Recognition (1 hour and 10 minutes)
    1. (Very) Brief overview of recognition models and processing
    2. Vocabulary Issues
      1. Examination of vocabulary statistics and coverage
      2. Vocabulary expansion from supplemental materials
    3. Language modeling issues
      1. Spontaneous conversational speech vs. read speech
      2. Appropriateness of written materials for LM
      3. Language model adaptation
    4. Acoustic modeling issues
      1. Speaker independent modeling
      2. Speaker dependent modeling
      3. Supervised and unsupervised acoustic model adaptation
    5. OOV modeling
      1. Methods for recognizing OOV words
      2. Phonetic transcription of OOV words
  3. Audio Retrieval ( 1 hour and 30 minutes)
    1. Brief overview of text retrieval algorithms:
      1. TF-IDF/vector space methods
      2. probabilistic methods
      3. large scale web search (Google)
      4. inverted indexing; query processing/language.
    2. ASR lattices:
      1. word/phone/OOV-models for generation
      2. lattice accuracy vs. 1-best accuracy
    3. Query processing (OOV problem)/language:
      1. "soft"-indexing (need to index probabilities); index pruning to control size;
      2. need to combine both sub-word and word-level indexing/recognition results
    4. Relevance scoring:
      1. proximity
      2. incorporating various data streams: speech, text, title, author, abstract, etc.
      3. tuning precision/recall at query run-time
    5. Evaluation:
      1. set metrics: Precision/Recall/Precision
      2. ordered list metrics: Kendall-Tau, Spearman
      3. TREC measures and package (Mean Average Precision, R-precision)
      4. for speech one can use the output on transcription as reference
    6. User interface:
      1. issues in consuming speech (as opposed to text, images); limited bandwidth channel
      2. Pro's/con's for displaying transcription
      3. Navigation in long documents, hits may be errorful
      4. Segmentation: topic boundaries, keywords, summaries
  4. Summary and Conclusions

Ciprian Chelba will present 1.A-B, and 3.A-E.
T. J. Hazen will present 1.C, 2.A-E, 3.F, and 4.

Target Audience and Prerequisites

The material presented is a basic tutorial, trying to balance a broad overview of the area with actual results form the authors' own work on the MIT iCampus lecture corpus.

Basic knowledge about probability theory and statistical modeling for speech and language should be sufficient for successfuly attending.

Speaker Biographies

Ciprian Chelba is a Research Scientist with Google. Previously he worked as a Researcher in the Speech Technology Group at Microsoft Research.

His core research interests are in statistical modeling of natural language and speech. Recent projects include speech content indexing for search in spoken documents, discriminative language modeling for large vocabulary speech recognition, as well as speech and text classification.

Timothy J. Hazen is a Research Scientist at the MIT Computer Science and Artificial Intelligence Laboratory where he works in the areas of automatic speech recognition, automatic person identification, multi-modal speech processing, and conversational speech systems. For the last two years he has been a key contributor to the MIT Spoken Lecture Processing Project.


©2012 Conference Management Services, Inc. -||- email: webmaster@icassp2007.com -||- Last updated Wednesday, April 04, 2007