TAPE workshop on the curation and preservation of audiovisual collections University of Glasgow, Scotland, UK Monday 12th – Friday 16th May 2008 Metadata and Documentation Giorgio Dimino RAI Research Centre [email protected] Centro Ricerche e Innovazione Tecnologica The two main objectives of archive management Preservation Keep assets in life Access Make content available to users and customers Centro Ricerche e Innovazione Tecnologica Is digital preservation sufficient to improve access? Generally NO! Access performance is driven by two main factors: 1. the time needed to retrieve and select content 2. the time needed to deliver the selected content to the user in the requested format Factor 1 is often the most critical Centro Ricerche e Innovazione Tecnologica The larger the collection, the more selection is difficult Need for descriptive metadata and a proper retrieval system Centro Ricerche e Innovazione Tecnologica Identify user needs Analyse the “business” and define use cases Define access granularity Identify the basic entities of the model (objects) Define preferred search criteria Consider the possible need for other access methods Thematic Dossiers Showcases Identify interoperability issues with other systems Centro Ricerche e Innovazione Tecnologica Who are the users? Professionals Often looking for specific things Can handle complex data models Prefer precision to simplicity Centro Ricerche e Innovazione Tecnologica General public Not necessariliy technology fans Simple interface and search tools (e.g., Google) Can need help (proactive systems) Access granularity Access granularity depends on the content genre and forseen usage E.g., RAI strategy Main TV programmes: programme item Fiction/Movies: programme News: news story Sport: Radio: programme item Music: track Proper documentation models must be designed and enforced to support the required access granularity This could bring to redocument all the archive content Centro Ricerche e Innovazione Tecnologica Browsing Often selection cannot be accomplished without viewing the results of retrieval (a survey on the RAI archive shows that about 80% of the tape handling was due to viewing requests) Viewing high res digitised content or master analogue media is very expensive and time consuming Multimedia documentation, based on the use of key frames and low res (e.g., MPEG4) copies provides cheap and fast selection of footage, contextual to retrieval and on the user desktop An order can be automatically issued to central archive for the download of the selected footage to the user Centro Ricerche e Innovazione Tecnologica Archive duality Documentation maps to editorial entities Programs, collections, items, … Editorial entities must be mapped on essence copies E.g., a copy of program “abcd” is contained on tape 1234 from TC 00:01:00:00 to TC 00:10:00:00 Essence maps to physical media or files Several essence versions can co-exist at the same time E.g., original Beta SP tape, digital master, low res MPEG4, etc… If they are time aligned any one can be used as a proxi for the others Documentation to Essence is a one-to-many relationship The same documentation applies to several essence versions Centro Ricerche e Innovazione Tecnologica Linking elements Time references Real world time representation time unit count since a reference date Gregorian date and day time Media stream time Offset and duration in frame/sample units Centro Ricerche e Innovazione Tecnologica Media locators URL physical positions Object references object unique identifiers UMID UPID Metadata management criticalities Documentation models Driven by internal requirements No single standard Documentation costs and quality Manual annotation is expensive and time consuming Subjectivity must be avoided Automatic content analysis can be helpful but is still experimental Data models They are the implementation of a documentation model They must be designed in such a way to allow the implementation of the retrieval requirements Centro Ricerche e Innovazione Tecnologica Documentation strategies Collection Programme 1 segment1 Programme 2 segment1 segment1 Shot 2 segment1 segment1 Item 1 Shot 1 segment1 segment1 segment1 Programme 3 Item 1 segment1 •Hierarchy of documentation entities •Time relations between entities must be exploited in retrieval •Each entity has attached a set of attributes •Rigid structure, extensibility limited to attributes •Implementation can be optimised for retrieval Item 1 Shot 3 •Stratification of timed documentation attributes •Single documentation entity that represents the program •Very flexible and extensible •Difficult to exploit in retrieval Centro Ricerche e Innovazione Tecnologica *From EBU P/FTA Future Television Archives report Data model requirements Interoperability with existing standards EBU P/META, ISO MPEG7, Dublin Core, SMPTE MXF and DMS1 Clean separation between editorial and material related information Definition of the basic entities and relations Incremental definition of specialized entities and attributes according to the needs Centro Ricerche e Innovazione Tecnologica Comparison between users data models entities (from PrestoSpace Deliverable D15.1) ENo. RAI-DM 1 INA-DM ORF-FARAO DR-DM Collection BBC-SMEF MXF-DMS1 P_META DC Programme-Group Programme -Group Programme Programme Program Main/Single Programme Production Production framework Programme Programme Item Program Item (“Contribution”) Item Programme Item Scene framework Item MOB (Media Object) MOB (Media Object) 4 5 MIN (Media MOI (Media Object Object Instance) Instance) 6 Brand 2 3 Programme Item Centro Ricerche e Innovazione Tecnologica MOB (Media Object) Brand Dublin Core vs. P/META Dc:title --> Dublin Core P_META (EBU) Title A59,A61,A99, A107, A110, A114,A146,A198,A401 Creator A81,A82,A83,A87,A88,A89,A90,A125, A254, A255,A256,A413, A414 Subject and Keywords Difficult to define, Coverage could be used here Description / Publisher A81,A82,A83,A87,A88,A89,A90,A125,A254,A255,A256,A413,A414 Contributor A11,A81,A82,A83,A87,A88,A89,A90,A125,A254,A255, A256,A413, A414 Date A152,A217,A218,A219,A22, A367,A368,A405 Resource Type A12,A13,A226 Format A72,A73,A222, A361 Resource Identifier A105 Source A223,A224 Language A21,A22,A65,A66,A141, A407,A415, A416,A417, A418 Relation Resources / Coverage A1,A9,A21, A22,A38,A67, A123,A141, A207,A214 Rights Management A14, A15, A18, A19, A20, A116, A117,A118, A119, A120, A121, A122, A162, A200, A201,A202, A203, A204, A205, A206, A212, A421, A422 Program-Title Program-Sub-Title Program-Working-Title Program-Episode-Title Item-Title Item-Sub-Title PGR-Title PGR-Sub-Title PGR-Working-Title Centro Ricerche e Innovazione Tecnologica European Digital Library project eContentplus project that addresses the integration of the bibliograpich catalogues and digital collection of most of the European National Libraries Main target is libraries, but museums, archives and AV collections are also included Metadata are encouraged to be made available through the use of OAI (Open Archives Initiative) guidelines Centro Ricerche e Innovazione Tecnologica Open Archives Initiative The Open Archives Initiative develops and promotes interoperability standards that aim to facilitate the efficient dissemination of content The Protocol for Metadata Harvesting (OAI-PMH) specifies methods for accessing heterogeneous collections and requires Dublin Core compatibility as the minimum metadata dissemination level Centro Ricerche e Innovazione Tecnologica EDOB – the PrestoSpace data model Programme descriptive metadata Based on the P/META schema Timed metadata Based on MPEG7 Temporal Decomposition Programme – Material associations Custom structure Centro Ricerche e Innovazione Tecnologica EDOB structure Centro Ricerche e Innovazione Tecnologica EDOB subclasses Centro Ricerche e Innovazione Tecnologica Data model and data formats (xml) E1:EditorialObject Publication Event • datetime • etc S2:PublicationService S1:PublicationService Contribution • role type P3:Person P2:Person P1:Person O1:Organisation Centro Ricerche e Innovazione Tecnologica Root element / Wrapper Identification information titles identifiers contributions publications other Data model and data formats E1:EditorialObject Root element / Wrapper Identification information Material realisations M2:Material S2:storage/file M1:Material S1:storage/file Centro Ricerche e Innovazione Tecnologica Data model and data formats E1:EditorialObject t1:transcription t1:transcription t1:transcription t1:transcription t1:transcription t1:transcription Root element / Wrapper Identification information v1:shot k1:keyframe v1:shot k1:keyframe v1:shot k1:keyframe v1:shot k1:keyframe v1:shot k1:keyframe v1:shot k1:keyframe v1:shot k1:keyframe v1:shot k1:keyframe Material realisations Editorial partitions and views E5:EditorialPart timeline E4:EditorialPart E3:EditorialPart E2:EditorialPart R2:RelatedSource R1:RelatedSource Topic Named Entities T1:Time L1:Location O1:Organisation P1:Person Centro Ricerche e Innovazione Tecnologica Editorial parts Shots & other video segmentation Speech transcription & Other audio segmentation Data model and data formats E1:EditorialObject Root element / Wrapper Identification information Material realisations Editorial partitions and views E5:EditorialPart timeline E4:EditorialPart E3:EditorialPart E2:EditorialPart R2:RelatedSource R1:RelatedSource Topic Named Entities T1:Time L1:Location O1:Organisation P1:Person Centro Ricerche e Innovazione Tecnologica Content related information Enrichment information Schema of PrestoSpace document format (XML) Root element / wrapper Ad hoc structures Identification and Language information Material realisations P_META sets Editorial partitions and views Content related information Enrichment information Ancillary Data Centro Ricerche e Innovazione Tecnologica MPEG7 profile nodes XML schema composition imports PMETA 2.0 XML Schema imports MAD XML Schema imports MPEG7 DAVP profile XML Schema Core Platform Definitions XML Schema Centro Ricerche e Innovazione Tecnologica Automatic content analysis Which features? Video Colour Shape Texture Motion Audio sound effects instrument description speech recognition Centro Ricerche e Innovazione Tecnologica Why? Segmentation temporal spatial Documentation automatic documentation aid to the documentalist Query by example The Preservation Factory Migration units for preservation services Fast, efficient, affordable Using automation/process optimisation Centralised, delocalised, and/or mobile units… Role of PrestoSpace : Ensure these services take up Key technology development, communication, labelling, encouraging/assisting investors & users... Centro Ricerche e Innovazione Tecnologica Work Breakdown Analogue documents Film, video, audio Preservation Playback Devices Robotics and Automation Media Condition Assessment Restoration Storage System Tools Visual & Audio Algorithms & Subsystems Integration & Evaluation Mass storage cost management life cycle management Delivery & Access Turnkey System Export System Integration Preserved & digitized collections Centro Ricerche e Innovazione Tecnologica Archive Management Preservation and access business case planning Preservation project management tools Metadata Discovery & Structuring Public Access Delivery and Exchange PrestoSpace Factory Archive Transactions essence (master, lower quality) metadata (legacy, tech, enhanced) Centro Ricerche e Innovazione Tecnologica PrestoSpaceOrchestrator original media PrestoSpace Factory Preservation Unit Restoration Unit Documentation Unit Documentation process Legacy Metadata Import Audiovisual Content Analysis Semantic Analysis (Archive inventory) Legacy metadata import Automatic metadata extraction AV content analysis Semantic analysis on texts Web mining Manual annotation and validation Export/publication Centro Ricerche e Innovazione Tecnologica Human Validation Export/ Publication Getting preservation results The PSO (workflow manager) moves preservation results to documentation An EDOB file containing identification and media association information A digital master file A Quality/defect analysis report file A Preservation report file The master is transcoded to the lower quality formats required by the process Windows Media 9 for viewing in publication DVD quality MPEG2 for video content analysis PCM soundtrack for ASR Centro Ricerche e Innovazione Tecnologica Legacy metadata import VSEM00285973 DOCUMENT= 218 OF 3056 PAGE = 1 OF 1 PROGRAMMA ** F137725 ** --PAG A 010 *-DATCLASS 19951031 --DG TITOLI FATTI VOSTRI PIAZZA ITALIA DI SERA SUPPORTO RVM 3/4 D2 DATIPROD --RETE TV2 --SEDE RM --GENERE 320900 --UORG 2250 --MATRICOLA 262582 DATITRAS *-DATRAS 19951027 --ORE 2025 --CANALE 2 *-DURTOT 022444 COLORE AUTORI GUARDI MICHELE, FLORA GIOVANNA, ZAMPONI RORY, CIORCIOLINI MARCELLO. PRESENTA: MAGALLI GIANCARLO CON WINDHAM WENDY E I BARAONNA. A CURA DI MOLINARI LAURA. REGIA GUARDI MICHELE I0607 * End of document. Centro Ricerche e Innovazione Tecnologica Documentation platform EDOB Rich Content Documentation Platform MPEG7 PMETA DC MXF JPG Content Analysis Shots-key frames GAMPs Content Analysis Media Analysis Semantic Analysis Manual Annotation Delivery Centro Ricerche e Innovazione Tecnologica Core Platform web services Content Analysis Speech To Text EMS Essence and Metadata Storage Technologies – storyboard Key frames list Stripe image Centro Ricerche e Innovazione Tecnologica Technologies – feature extraction Camera Motion Automatic Speech Recognition Centro Ricerche e Innovazione Tecnologica Technology - segmentation Several segmentation tools Scene change detection Clustering of similar scenes Audio classifier (music, noise, speech) Voice tracking Lexical segmentation Editorial parts merger Centro Ricerche e Innovazione Tecnologica Technology - segmentation Centro Ricerche e Innovazione Tecnologica Technologies - semantics Classification Named Entities Centro Ricerche e Innovazione Tecnologica Source correlation Extension to other genres The effectiveness of the automatic analysis tools varies according to the content genres and expected use E.g., ASR not very useful on fiction The analysis process must be tailored accordingly The editorial parts segmenter must be adapted to reflect the editorial semantics (if possible) In some cases the process must rely mainly on manual annotation Centro Ricerche e Innovazione Tecnologica Manual Annotation Functionality validation and correction of automatic content analysis results (audiovisual and semantic) content structuring annotation on different structural levels of the content (programme, scene, shot, arbitrary temporal range) Integrated in documentation platform Centro Ricerche e Innovazione Tecnologica Centro Ricerche e Innovazione Tecnologica Export The documentation results are exported to external systems or to the Publication Platform Export package includes: Enriched EDOB Key frames, Stripe Images Video in browsing quality Everything got from Preservation & Restoration Deleted from the Documentation Platform Centro Ricerche e Innovazione Tecnologica Publication platform Rich Content Web interface Publication Platform Key Frames View Semantic Search (KIM Platform) Topic Search (Full text) Centro Ricerche e Innovazione Tecnologica http Full motion Video preview MCP Multimedia Contents Publisher Speech to text display Publication Platform architecture RETRIEVAL OF CONTENT User Interface Structured query Restricted natural language query CLIR Processor Context disambiguation Language translation SQL Engine Semantic Engine Data Base Centro Ricerche e Innovazione Tecnologica Domain Knowledge Base Retrieval of AV content Different types of retrieval are provided: Legacy Information (structured queries) Full Text Ontology-driven browsing Natural language queries No constraint on the nature of the query (non NL queries are also managed) NERC Cross-lingual Query Classification for domain-specific retrieval Centro Ricerche e Innovazione Tecnologica KIM semantic engine KIM is a platform for semantic annotation, search, and anaysis : Framework for automatic semantic annotations Storage of semantic annotations Semantic indexing and search Centro Ricerche e Innovazione Tecnologica Cross-language Retrieval (CLIR) Archive AV data span across different languages Retrieval should be concept rather than text oriented Source language (of queries) can be different from the target language (characterizing metadata) Centro Ricerche e Innovazione Tecnologica CLIR functionalities The implemented CLIR analyses the user query to extract NEs if any Context categorization Useful terms for FTS (removing stopwords) It then maps the extracted information in the target language The new query is substituted to the orginal one Centro Ricerche e Innovazione Tecnologica CLIR translation example Typed Query: “Blair calls on NATO member to contribute more troops to Afghanistan force” Translated Query: Person:Blair, Organization:Nato, Location: Afghanistan, Category:foreign affairs, Text_en:blair, nato, member, troops, force, afghanistan Text_it:blair, nato, membro, truppe, arma, afghanistan Centro Ricerche e Innovazione Tecnologica Considerations The relationship between descriptive metadata and essence is generally not one-to-one Descriptive metadata are more efficiently exchanged and managed if kept separate from the essence Mapping of different metadata schemas is unavoidable Lossless mapping is possible only if Basic concepts are shared between models Entities and attributes are well described and understood The mapping requires human skills The PrestoSpace data model and documentation process are described in Deliverable D15.2 Centro Ricerche e Innovazione Tecnologica