RepoZip : a technique for lossless compression of document collections

dc.contributor.authorSumanaweera, DN
dc.contributor.authorDoole, FF
dc.contributor.authorPathiraja, DP
dc.contributor.authorDeshapriya, GGK
dc.contributor.authorDias, G
dc.date.accessioned2018-09-19T21:08:16Z
dc.date.available2018-09-19T21:08:16Z
dc.description.abstractMany computer systems; especially in corporations, contain large amount of documents such as letters, reports and presentations. Many such documents are present in several versions. Such data needs to be synchronized with branch offices and mobile devices, often over slow and expensive connections. However, as many documents are stored in an already compressed format, it is difficult to compress them further by exploiting the hidden redundancies. We present a novel approach named RepoZip which improves the compression of an existing compression algorithm over a document collection, by exploiting the inter-document meta-data and content-level redundancies. It concentrates on compressing OOXML documents that have been constructed through the archival of a hierarchy of meta-data files and PDF documents which include deflated content streams. Therefore, the RepoZip approach achieves larger compression gains over OOXML document collections or PDF document collections by exploiting usually undetected meta-data level similarities.en_US
dc.identifier.conferenceMoratuwa Engineering Research Conference - MERCon 2015en_US
dc.identifier.departmentDepartment of Computer Science and Engineeringen_US
dc.identifier.emaildinithi.10@cse.mrt.ac.lken_US
dc.identifier.emailfahima.10@cse.mrt.ac.lken_US
dc.identifier.emaildaham.10@cse.mrt.ac.lken_US
dc.identifier.emailkelum.10@cse.mrt.ac.lken_US
dc.identifier.emailgihan@uom.lken_US
dc.identifier.facultyEngineeringen_US
dc.identifier.placeMoratuwa, Sri Lankaen_US
dc.identifier.urihttp://dl.lib.mrt.ac.lk/handle/123/13577
dc.identifier.year2015en_US
dc.language.isoenen_US
dc.subjectKeywords—lossless compressionen_US
dc.subjectmeta-data similarity
dc.subjectOOXML
dc.subjectclusters; generalized suffix tree
dc.titleRepoZip : a technique for lossless compression of document collectionsen_US
dc.typeConference-Abstracten_US

Files

Collections