Block-based Migration from HTML4 Standard to HTML5 Standard in the Context of Web Archives

Autores/as

  • Andrés Sanoja
  • Stéphane Gançarski

Palabras clave:

Migration, Web, Segmentation, Blocks, HTML5, Web Archive, Format Obsolescence

Resumen

Web archives are not exempt of format obsolescence. In the near future Web pages written in HTML4 format,could be obsolete. We will have to choose between two preservation strategies: emulation or migration. The first optionis the most evident, however due to the size of the Web and the amount of information that Web archives handle it isnot practical. In the other hand migration to HTML5 format seems plausible. This is a challenge because we need tomodify a page (in HTML4 format) and include elements that not even exists in this format (as the HTML5 semanticelements). Using the Web page segmentation we show that, with the appropriate granularity, blocks look alike thesesemantic elements. We present the use our segmentation tool, BoM (Block-o-Matic), for helping achieve the migrationof Web pages from HTML4 format to HTML5 format in the context of Web archives. We also present an evaluationframework for Web page segmentation, that helps to produce metrics needed to compare the original and migrated version.If both versions are similar the migration has been successful. We show the experiments and results obtained on a sampleof 40 pages. We made the manual segmentations for each page using our MoB tool. Results shows that in the migrationprocess there is no data loss but in the migrated version (after adding the semantic elements) the margin is changed. Thisis, it adds whitespace that change the elements position, shifting elements slightly on the page. While this is imperceptibleto the human eye, for systems it is difficult to handle without previous knowledge of this situation.

Descargas

Los datos de descargas todavía no están disponibles.

Descargas

Número

Sección

Artículos