Compressed indexable XML representation of astronomical data
Source of Funding: PPARC
e-Science presents computer scientists with new challenges in terms of handling huge volumes of data. The student allocated on this project will work closely with people involved in the AstroGrid project, and is concerned with the efficient storage and processing of large XML files that arise in the context of the International Virtual Observatory Alliance (IVOA). VOTable is an XML-based astronomical data format developed by the IVOA for tables and (later) images.
Unfortunately XML-based files are larger than the binary equivalent (such as FITS), and network bandwidth will be a scarce resource for the Virtual Observatory. Different VOTable encodings allow trade-offs between efficiency and ease of parsing. Even within the XML community at large there is growing concern that inefficiency arising from document size will hinder adoption and use of XML. A few XML-specific approaches can compress XML files better than generic algorithms such as gzip However, compression ratios can vary greatly (from 3:1 to 66:1) on different kinds of data. One issue then is to to understand the characteristics of astronomical XML files and invent or discover a compression method for these files.
Although compressed files are much smaller, their contents become inaccessible until uncompressed. Indeed, it would be impossible even to support the most rudimentary approaches to searching a compressed XML database, such as searching for sections that match an Xpath expression. Thus, another issue is to develop a compressed file representation that supports sequential searching through the file for the necessary structural and semantic components.
As XML-based formats such as VOTable become the norm for the extraction of data from astronomical archives, XML is likely to follow FITS in being used not only for data interchange but also for data storage. XML-based databases will therefore assume increased importance. However, current XML technology is not efficient enough to scale well. A final issue is to develop a compressed in-memory representation that supports complex queries.
The software developed as part of this project is now being taken further in the SiXML project.
Author: Rajeev Raman (r.raman at mcs.le.ac.uk), T: +44 (0)116 252 3894.