Where to download reference genome?

To download reference data, there are a few different sources available: General biological databases: Ensembl, NCBI, and UCSC. Organism-specific biological databases: Wormbase, Flybase, etc. ... Reference data collections: Illumina's iGenomes, one location to access genome reference data from Ensembl, UCSC and NCBI. More items...

Why is every read mapped to the reference genome by alignment?

The first phase (Alignment) involves aligning or mapping the reads to the reference genome. This tells you which precise location in the genome each base pair in each sequencing read comes from .

What is the current version of the reference genome?

The latest human reference genome assembly, released by the Genome Reference Consortium, was GRCh38 in 2017. Several patches were added to update it, the latest patch being GRCh38. p14 , published in March 2022.

What digital app store designed to read genomes?

Based on the analysis, the correct answer is Helix , as it is a company that provides DNA testing and analysis services, which aligns with the description of a digital app store designed to read genomes.

Which tools are for aligning reads to a reference genome?

To align the reads to the reference genome we will use BWA aln , which supports an end-to-end alignment of reads to the reference sequence. The alternative algorithm, BWA mem supports also local (portion of the reads) and chimeric alignments (resulting in a larger number of mapped reads than BWA aln ).

Is alignment the same as mapping?

Find the approximate origin of a sequence. Alignment: Find the exact difference between two sequences .

What is a good mapping rate?

A mapping rate of 70% or higher is great, and lower than 50% is concerning especially if total read depth is low. Low mapping rates can be a result of low quality reads, contaminating sequences, inappropriate alignment parameters chosen, or a poor quality reference genome.

How to find reference genome?

How to find a reference genome NCBI. Genbank. RefSeq. Ensembl. Microbial Genome Database. Saccharomyces Genome Database. Phast. DDBJ. More items... Jan 8, 2019

How to download hg38 reference genome?

hg38/GRCh38 is the latest human reference genome as of today which was released December, 2013. There are multiple sources for downloading it and also it comes in different versions. The most well-known databases to use for downloading the human reference genomes are UCSC Genome Browser, Ensembl and NCBI .

Where can I download DNA sequences?

NCBI SNP tracks and uploaded or streamed variation (VCF) tracks can be downloaded in VCF format. To obtain VCF files of whole genome NCBI SNP annotation, please go to the NCBI SNP FTP site at ftp://ftp.ncbi.nlm.nih.gov/snp/ .

Can you download the human genome?

Genome data packages can be downloaded by NCBI Taxonomy ID or taxonomic name, NCBI Assembly accession, or NCBI BioProject accession .

HUGO: Hierarchical mUlti-reference Genome cOmpression for aligned reads.

Downloaded from jamia.bmj.com on October 3, 2014 - Published by group.bmj.com

Research and applications

HUGO: Hierarchical mUlti-reference Genome cOmpression for aligned reads Pinghao Li,1 Xiaoqian Jiang,2 Shuang Wang,2 Jihoon Kim,2 Hongkai Xiong,1 Lucila Ohno-Machado2 ▸ Additional material is published online only. To view please visit the journal online (http://dx.doi.org/10.1136/ amiajnl-2013-002147). 1

EE Department, Shanghai Jiaotong University, Shanghai, China 2 Division of Biomedical Informatics, University of California–San Diego, La Jolla, California, USA Correspondence to Pinghao Li, EE Department, Shanghai Jiaotong University, SEIEE Building 5-516, 800 Dongchuan RD, Minhang District, Shanghai 200240, China; [emailprotected] Received 1 July 2013 Revised 27 November 2013 Accepted 3 December 2013 Published Online First 24 December 2013

Open Access Scan to access more free content

To cite: Li P, Jiang X, Wang S, et al. J Am Med Inform Assoc 2014;21: 363–373.

ABSTRACT Background and objective Short-read sequencing is becoming the standard of practice for the study of structural variants associated with disease. However, with the growth of sequence data largely surpassing reasonable storage capability, the biomedical community is challenged with the management, transfer, archiving, and storage of sequence data. Methods We developed Hierarchical mUlti-reference Genome cOmpression (HUGO), a novel compression algorithm for aligned reads in the sorted Sequence Alignment/Map (SAM) format. We ﬁrst aligned short reads against a reference genome and stored exactly mapped reads for compression. For the inexact mapped or unmapped reads, we realigned them against different reference genomes using an adaptive scheme by gradually shortening the read length. Regarding the base quality value, we offer lossy and lossless compression mechanisms. The lossy compression mechanism for the base quality values uses k-means clustering, where a user can adjust the balance between decompression quality and compression rate. The lossless compression can be produced by setting k (the number of clusters) to the number of different quality values. Results The proposed method produced a compression ratio in the range 0.5–0.65, which corresponds to 35– 50% storage savings based on experimental datasets. The proposed approach achieved 15% more storage savings over CRAM and comparable compression ratio with Samcomp (CRAM and Samcomp are two of the state-of-the-art genome compression algorithms). The software is freely available at https://sourceforge.net/ projects/hierachicaldnac/with a General Public License (GPL) license. Limitation Our method requires having different reference genomes and prolongs the execution time for additional alignments. Conclusions The proposed multi-reference-based compression algorithm for aligned reads outperforms existing single-reference based algorithms.

burden for data storage and transmission. For example, the DNA of an individual is comprised of two complementary copies of more than 3.2 GB of bases. Today, the 1000 Genomes Project2 has produced more than 50 TB of data for 1092 individuals from 14 populations, and is quickly moving toward the goal of sequencing 2500 people.3 Many other genome data repositories are also expanding quickly. For example, the size of the Sequence Read Archive (SRA), a public repository of sequencing data, will exceed 1000 TB by the end of 2013.4 Although the storage capacity keeps increasing, it cannot catch up with the speed of sequencing data generation. Thus, advanced genome compression mechanisms need to be developed to reduce the storage and reduce the bandwidth that is necessary for transferring genome data. Sequence alignment is the ﬁrst and essential step in genome analysis. Alignment tools like BLAST,5 CUDASW++,6 or Bowtie,7 8 9 take advantage of the fact that the nucleotide diversity within the same species is relatively small—the difference in humans is around 0.1%, that is, 1 base difference per 1000.10 Two formats, FASTQ and Sequence Alignment/Map (SAM) are becoming the industry standards for NGS data. FASTQ, which is a common input to alignment tools, is a text-based format for storing a biological sequence and its corresponding quality scores. The SAM ﬁle, which is an output from the alignment tool, follows a TAB-delimited text format consisting of an optional header section and an alignment section. Each alignment line has a variable number of optional ﬁelds or aligner-speciﬁc information, and 11 mandatory ﬁelds with essential alignment information, such as the mapping position. A SAM ﬁle is commonly compressed in the Blocked GNU Zip Format (BGZF) format and converted to a smaller binary BAM ﬁle,11 which is supported by Illumina GA/ HiSeq12 and Roche 454.13 In this study, our core compression algorithm focuses on the SAM format.

BACKGROUND AND SIGNIFICANCE

Related work

Massively parallel sequencing is leading revolutionary advances in biology and helping unravel the genetic basis of disease.1 With the advent of next generation sequencing (NGS) technologies supported by companies such as Helicos Biosciences, Paciﬁc Biosciences, and Illumina, the cost of sequencing the whole genome has decreased dramatically in the past few years and it is expected to drop below $1000 per individual in the near future. However, the massive amount of data produced by whole genome sequencing is becoming a large

There are some analysis toolkits that deal with NGS data, such as Goby14 and GATK15 16 for alignment data. Sakib et al17 presented a speciﬁc encoding scheme SAMZIP for SAM ﬁles by exploiting knowledge of the SAM format and its speciﬁcations. It processes each ﬁeld independently using encoding techniques such as run-length encoding (RLE), delta encoding, Huffman coding, and dictionary coding. The SLIMGENE algorithm implemented by Kozanitis et al in collaboration with iDASH (a national center for biomedical computing)18 can achieve 40× compression ratio (CR)

Li P, et al. J Am Med Inform Assoc 2014;21:363–373. doi:10.1136/amiajnl-2013-002147

363

Downloaded from jamia.bmj.com on October 3, 2014 - Published by group.bmj.com

Research and applications of genomic fragments without quality values and 5× CR when quality values are included. Hach et al19 introduced another compression approach called SCALCE, for HTS (high throughput sequencing) genome (or transcriptome, exome, etc) sequences. HTS is based on the reorganization of reads to boost the locality of reference. The aforementioned algorithms compress genome data by exploring the redundant structure of the target, however they do not take full advantage of publicly available genome information. Recently, Fritz et al20 presented a reference-based compression method called CRAM. CRAM demonstrated a high CR by enabling lossy compression on the quality information and unaligned sequences. CRAM remaps unmapped reads (UMRs) using a secondary compression framework that remaps the unaligned reads again to other reference sequences. Regarding lossless compression algorithms for FASTQ and SAM/BAM formats, Quip, which was proposed by Jones et al,21 is one of the ﬁrst assembly-based compressors using a de novo assembly algorithm. Later on, Popitsch et al22 presented another set of lossless and lossy compressors called NGC. The NGC algorithm improves the compression efﬁciency by exploiting the redundancy within the common features of reads that are mapped to the same genomic positions, followed by a highly conﬁgurable strategy for quantizing per-base quality values. Samcomp, introduced by Bonﬁeld et al,23 uses the SAM ﬂags, position, and cigar strings to anchor each called base to a reference coordinate and encodes the base according to a percoordinate model. The accuracy and compression performances rely on the availability of aligned data at each speciﬁc reference coordinate. The algorithms listed above either rely on a single reference genome or generate a de novo assembly that is entirely selfcontained to align the target sequence for compression. However, such strategies usually result in a low rate of exact mapped reads (EMRs), because the compression of inexact mapped reads (IMRs) (ie, reads with four or fewer mismatches) and UMRs (ie, reads with more than ﬁve mismatches) usually requires more bits than the compression of EMRs. In contrast, we aim at improving compression efﬁciency by increasing the rate of EMRs by considering alternative references. Our strategy relies on sequentially mapping reads against multiple reference genomes as well as on an adaptive read length shortening mechanism. Our proposed algorithm, Hierarchical mUlti-reference Genome cOmpression (HUGO), compresses a SAM format ﬁle sorted by position within the reference by extracting the 11 mandatory ﬁelds and a variable number of optional ﬁelds into 12 separate ﬁles. Because all these ﬁelds have low intercorrelations,17 we can process each ﬁeld in parallel. The workﬂow of the proposed algorithm is shown in ﬁgure 1. Since the ‘Sequence’ and the ‘Quality value’ ﬁelds represent more than 50% of the total data and are usually hard to compress (due to lack of regularity), we concentrate on the compression of these two difﬁcult ﬁelds. For the ‘Sequence’ ﬁeld, we align the reads against the reference sequence using SOAP3,24 a Binary Alignment/Map, and Graphic Processing Unit (GPU)-based aligning software, where the alignment information for the EMRs is recorded for compression. SOAP324 can be 20× faster than CPU-based tools like BWA or Bowtie. Since HUGO is designed to conduct several rounds of alignments, speed is of critical concern. We realign the IMRs and UMRs to different reference sequences, which, combined to an adaptive read length shortening mechanism, turns these reads into EMRs. For the ‘Quality’ ﬁeld, we further propose a lossy quantization 364

approach using the k-means clustering algorithm and investigate its impact on downstream applications. For the remaining ﬁelds in the SAM ﬁle, we explore their self-regularity and interrelationship, and then adopt appropriate compression algorithms for each of them.

MATERIALS AND METHODS Datasets The datasets are taken from several types of NGS datasets. Sequences whose names started with ‘NA’ or ‘HG’ were taken from the 1000 Genomes Project. The remaining sequences were from the SRA under study numbers/run accession numbers: ChIP-Seq (mouse): SRX014899/SRR032209; RNA-Seq (Escherichia coli): ERX007969/ERR019653. Additionally, we used the HCC1954 genome from the University of California, Santa Cruz mixed spike-in low coverage sample for the Cancer Genome Atlas Benchmark 4, found at https://cghub.ucsc.edu/ datasets/benchmark_download.html.

Methods The SAM format has one header section with approximately 100 lines in total, and one alignment section. The header section can be identiﬁed by the preﬁx character ‘@’. Each alignment record in the alignment section consists of 11 mandatory ﬁelds and a variable number of optional ﬁelds. As mentioned earlier, since there are very low inter-ﬁeld correlations,17 we can independently process each ﬁeld in parallel. We can thus reduce the time used for compression. Since this paper mainly focuses on the compression of ‘Sequence’ and ‘Quality value’ ﬁelds, a review of classic encoding schemes for the remaining ﬁelds in the SAM format is shown in online supplementary appendix 1.

Sequence ﬁeld encoding design Our method focuses on the efﬁcient compression of genome sequences by seeking the best match among multiple reference genomes. The rationale for using a reference genome to improve compression performance is mainly based on the fact that the nucleotide diversity within human species is relatively small, so most reads in a re-sequencing run can ﬁnd EMRs or IMRs within the reference sequences. Although a part of differences between the reads and the reference genome stems from sequencing errors instead of genetic variations, the ratios among EMRs, IMRs, and UMRs for a given set of input sequence usually vary considerably for different references that are used in the alignment, since different references usually reﬂect different sample characteristics, such as ethnicity of the sample donor. Figure 2 shows an example of the alignment results for NA12878chrom2025 against four different reference sequences: hg18, hg19, HuRef chromosome 20, and HuRef (see online supplementary appendix 4 for details of these references). Given the short reads from the 1000 Genomes sample NA12878,25 the reference genome hg19 shows the highest rate of EMRs when compared to the other three references due to its completeness and integrity. However, there are still more than 25% reads partially mapped or unmapped. Existing genome compression methods handle IMRs and UMRs either (a) in their original format (eg, NGC) by storing UMRs in BAM format while quantizing their quality values in lossy compression modes, or (b) in a lossless manner (eg, SLIMGENE and CRAM) by introducing extra bits to record the mismatches. Original storage of IMRs and UMRs in BAM format usually achieves a low compression rate, and lossy quantization of quality values is not always favored by researchers who require lossless recovery of quality values for their downstream Li P, et al. J Am Med Inform Assoc 2014;21:363–373. doi:10.1136/amiajnl-2013-002147

Downloaded from jamia.bmj.com on October 3, 2014 - Published by group.bmj.com

Research and applications

Figure 1 The framework for the proposed genome compression algorithm. The input ﬁle ﬁrst goes through the exact matching scheme (leftmost branch) to iteratively identify matched sequences with reduced length. Unmatched sequences below a certain length threshold are compressed by PPMVC (command line parameter is ‘e’), a variant of Prediction by a partial matching (PPM) algorithm27 based on context modeling and prediction. The remaining ﬁelds, except for the quality value, are handled by regularity based coding (middle branch). We support two versions of quality value compression: (1) lossy version using k-means clustering and multi-thread PPMVC (command line parameter is ‘e –o4 –r1’), (2) lossless version with multi-thread PPMVC (command line parameter is ‘e –o4 –r1’) only (rightmost branch).

applications. Although the existing lossless compression algorithms for IMRs and UMRs preserve the best data utility, they usually require a large amount of extra bits, and therefore they deteriorate the compression performance. For example, in order to encode IMRs/UMRs, the information about IMRs/UMRs’ positions, strands (forward mapping or inverse mapping), substitutions, insertions, or deletions need to be recorded. In fact, such approach may not be the most efﬁcient way to handle

IMRs/UMRs, since a long IMR/UMR could be represented by multiple EMRs with shorter length in the same reference, which require fewer bits and usually result in better compression efﬁciency. Figure 2 illustrates that different references can provide different alignment coverage for a given set of input sequences. In other words, an IMR/UMR that cannot ﬁnd an exact match in one reference might be perfectly aligned with another reference, or broken down into a smaller sequence that is an EMR.

Figure 2 Alignment results of NA12878chrom20 against the four references. hg18 (upper-left), hg19 (upper-right), HuRef chromosome 20 (lower-left) and HuRef (lower-right). Each pie chart depicts the percentage of exact mapped reads (EMRs) with 0 mismatch, IMRs with 1 to 4 mismatches, and unmapped reads (UMRs) with more than 4 mismatches.

Li P, et al. J Am Med Inform Assoc 2014;21:363–373. doi:10.1136/amiajnl-2013-002147

365

Downloaded from jamia.bmj.com on October 3, 2014 - Published by group.bmj.com

Research and applications Based on the above observations, we propose a HUGO framework that incorporates mapping to alternative references as well as an adaptive read length shortening scheme to improve the compression performance for IMRs/UMRs. Figure 3 shows an example of aligning the input ﬁle NA12878chrom20 against two references, hg19 and HuRef, using the read length shortening scheme, where IRMs or UMRs generated from hg19 are ﬁrst shortened by half and realigned against hg19 or HuRef. Then, the HUGO encoder keeps reducing the lengths for all new generated IMRs/UMRs and realigning shorter reads until it reaches a predetermined threshold. We can see that the proportions of IMRs and UMRs decrease gradually as the read length gets shorter. In particular, the percentage of UMRs reduces to 0 when the read length is shortened to 19, as shown in ﬁgure 3. Although ﬁgure 3 illustrates the advantages of multi-reference alignment and of a length shortening scheme for reducing the rate of IMRs and UMRs, it still shows each reference being considered independently. In ﬁgure 4, we explain how to build a hierarchical framework that can interactively select the best reference. In this case, we use HG00096chrom11 sequence25 against two references, hg19 and HuRef. For example, in the second iteration, since the percentage of EMRs aligned by HuRef.fa (ie, 14.57%) is higher than that of hg19.fa (ie, 12.5%), the HUGO codec will pick HuRef.fa as the best reference for read length L = 50. Following the same procedure,

ﬁgure 4 shows that a 99.99% EMRs rate, which is the sum over all the percentages in green, is achieved within four iterations. If two candidate references lead to the same EMRs rates, as shown in iteration 4 in ﬁgure 4, the reference that requires less information to represent EMRs will be selected. In practice, the EMRs rate in HUGO framework usually increases when sequences that are biologically more similar to that of the sample are utilized. In SOAP3, each EMR is represented by a very efﬁcient data structure, a tetrad of {Read ID, Chrome Id, Offset, Strand}, where the encoding method for each ﬁeld in the tetrad is presented as follows: Read ID: This ﬁeld refers to the read identiﬁer of an EMR in original sequence. Since the read ID of an EMR is usually adjacent to another (refer to online supplementary ﬁgure S3 in appendix 3), we only need to calculate the differences of read IDs between adjacent EMRs through Delta coding. Next, Huffman coding is used to encode these differences, where the empirical statistic indicates that most differences are equal to or less than 2. Chromosome ID: This ﬁeld represents the chromosome identiﬁer by which an EMR has been aligned. The vast majority of EMRs within the same sequence region are aligned to the same chromosome. Thus, run length encoding (RLE) is used to compress such identiﬁers. For example, an input data ‘wwwwddddd’ can be encoded as ‘w4d5’ in RLE. Offset: This ﬁeld indicates the leftmost position that the exact alignment occurs in the designated chromosome of the

Figure 3 An example of aligning the input ﬁle NA12878chrom20 against two references, hg19 and HuRef. Exact mapped reads (EMRs)/IRMs and unmapped reads (UMRs) generated from hg19 are shortened and passed through the realignment module against hg19 and HuRef. Since each read length shortening step will double the number of IMRs/UMRs, we also double the y-axis limits to maintain an effective visual scale. 366

Li P, et al. J Am Med Inform Assoc 2014;21:363–373. doi:10.1136/amiajnl-2013-002147

Downloaded from jamia.bmj.com on October 3, 2014 - Published by group.bmj.com

Research and applications

Figure 4 The procedure of aligning the input sequence HG00096chrom11 with references hg19 and HuRef. Here, a 99.99% exact mapped reads (EMRs) rate, which is the sum over all the percentages in green, is achieved in 4 iterations. The sequence length starts with L = 100 and the lengths of IMRs and unmapped reads (UMRs) are shortened by half after each iteration until L = 12 or 13. At each iteration, the reference, which leads to a lower IMRs/UMRs rate (as shown in red), will be selected as the current reference. reference sequence. Our experiments indicate that the offset positions of adjacent EMRs are close to each other (see online supplementary ﬁgure S3 in appendix 3). Similarly to the Read ID case, we also apply Delta coding and Huffman coding to calculate and encode these differences. Strand: This ﬁeld consists of two identiﬁers, ‘+’ and ‘−’. They represent the forward mapped read and reverse complementary mapped read, respectively. RLE is selected for its compression. Since the current release of 1000 Genome data is generated from Illumina sequencers, where the read lengths usually range from 70 to 120 bp, the ﬁnal read lengths after three read-length-shortening iterations will range from 17 to 30 bp. As observed in our experiments, further length splits with more than three iterations can only lead to a very little compression gain, but result in a high increase in time to compress. Thus, a maximum of three iterations will be used in our codec. For IMRs and UMRs generated in the last iterations, we encode them through the PPM ( prediction by partial matching) algorithm.26 27 The details of PPM can be found in online supplementary appendix 1. It is worthwhile noting that Fritz et al have proposed to remap UMRs in CRAM using a secondary compression framework that may be built from multiple datasets, including an alternative human sequence. However, our HUGO approach differs from CRAM in three main aspects: 1. Unlike our hierarchical structure, CRAM only remaps the unaligned reads once to either contiguous sequences that are present across the ﬁrst mapping or to another human Li P, et al. J Am Med Inform Assoc 2014;21:363–373. doi:10.1136/amiajnl-2013-002147

sequence (eg, HuRef ), and only later tries to map these reads to all bacterial and viral sequences to account for potential laboratory or sample contamination. 2. Given that it does not employ the length-shortening procedure from our approach, CRAM’s remapping operation can not map as many unaligned reads as our proposed method. 3. CRAM has not implemented its multi-reference slices even in the latest V.2.0. The efﬁciency of this proposal special framework has not been tested. Hence, the multi-reference idea does not seem to have been put into practice yet.

Quality value ﬁeld encoding design The proposed strategy for the quantization of quality values seems related to the NGC’s quantization ‘binning’ scheme. However, NGC determines the quantization borders by mapping all quality values that lie within an interval to a single value within this interval, and by allowing the application of different quantization schemas to quality values of different categories. In contrast, the proposed method automatically determines the quantization borders through a k-means clustering method to minimize the quantization error of all quality values by minimizing the within-cluster sum of squares (WCSS) of the errors. For sequencing data from a Illumina sequencer, the original phred quality values (Qphred) range from 0 to 93, where most of them are invisible ASCII characters. A shifted ASCII quality value QASCII from 33 to 126 is introduced in equation (1), where Pe represents the error probability of the corresponding base-call. In practice, QASCII has a wide value range and follows a quasi-random distribution, which results in high entropy, making 367

Downloaded from jamia.bmj.com on October 3, 2014 - Published by group.bmj.com

Research and applications it hard to efﬁciently compress the Quality value.17 21 28 29 Therefore, the best practice in most existing genome compression schemes20 22 28 is to compress QASCII in a lossy manner, which provides a trade-off between compression performance and data utility. QASCII 33 10

QASCII ¼ Qphred þ 33; Pe ¼ 10

;

Qphred ¼ 10log10 Pe

ð1Þ

However, the most important question for such a method is whether the information loss due to lossy compression on quality values will have a large impact on the results of downstream applications, such as software for variant calling,16 genotype calling, and for removal of potential PCR duplicates,30 which consider the quality values as an authoritative inference. To allow a ﬂexible adjustment between the compression efﬁciency and the data utility for downstream applications, we present a userconﬁgurable quantization scheme for the lossy compression of Q-value through P k-means clustering. Assume an individual has the total Lr ¼ W t¼1 mt quality values, where a read t has length mt and the total number of reads is W. The whole Lr quality values are reduced to n unique ones QASCII by {q1,q2,qj,…qn}, where each qj appears numj times in W reads. The k-means clustering aims at partitioning Lr quality values into k clusters, so that the quality values within the same cluster can be replaced by the quality values of the cluster center.31 We can minimize the WCSS in terms of the error probability:

arg min S

2 k X X qj 33 u 33 10 10 10 i10 numj i¼1

Algorithm 1: Proposed k-means clustering scheme for Q-values’ quantization and compression (1) (1) 1. Initiate the set of k means Uð1Þ ¼ fu(1) 1 ; u2 ; . . . ; uk g. ðtÞ ðt1Þ 2. while the set of k cluster center U = U do 3. Assignment step: Assign each observation to the cluster whose mean is closest. u(t) 33 qp 33 i (t) 4. Si ¼ fqp : 10 10 10 10 u(t) 33 qp 33 j 10 10 10 10 ; 81 j kg 5. Update center step: Calculate the new means to be the centroids of the observations in the new clusters. qj 33 1 X 6. u(tþ1) ¼ 10 logð (t) ð10 10 numj ÞÞ þ 33 i jSi j q [S(t) j i 7. end while 8. Calculate the mean absolute percentage error (ie, MAPE): q 33 u 33 k X j 10 i10 X 100 10 10 9. MAPE ¼ P qj 33 qi numi i¼1 qj [Si 10 10

ð2Þ

qj [Si

Where ui is the cluster mean of points in Si and the exponential terms come from the deﬁnition in equation (1). Algorithm 1 in box 1 shows the proposed k-means clustering scheme for Q-value quantization and compression. Based on algorithm 1, the quantization result after k-means clustering is shown in ﬁgure 5, where k = 10 and the input ﬁle is NA12878chrom20. In ﬁgure 5, the value of each cluster center is shown in red and qj’s within the same cluster are depicted in the same color. Online supplementary table S1 in appendix 3 illustrates the trade-offs between the compression performance using bzip2 and the quantization error in terms of mean absolute percentage error (MAPE) after k-means clustering for quality values. Although it decreases the storage by more than 50% with k = 5, the MAPE increases to 27.49%. This result indicates that the proposed quantization method might generate too much noise for small k’s. A more important efﬁcacy measurement is the impact of the lossy scheme to the downstream applications, especially variant calling. We tested such impact in the following section. Finally, it is worth mentioning that the proposed algorithm is equivalent to a lossless compression scheme without quantization when the k value is set to the number of unique quality values, which is 51 in this case.

The inﬂuence of quantizing per-base quality values on downstream analysis We used a popular variant calling tool, SAMTOOLS, for testing the variant and genotype prediction. The resulting variant sets from the ﬁle that was decompressed from the lossy compressed BAM ﬁle were then compared with the ones obtained from 368

Box 1 The algorithm description of the proposed k-means clustering scheme for Q-values’ quantization and compression.

uncompressed datasets that served as our ‘gold standard’. If the variant sets obtained from these two inputs show very similar results, we can declare that the lossy quantization of quality values has little inﬂuence in the downstream analysis. We counted the number of recovered (ie, tp: true positive), lost (ie, fn: false negative), and additional (ie, fp: false positive) variants, and recovered invariants (ie, tn: true negative). Then, we tp deﬁned the variant recovery precision as tpþfp , the sensitivity (a.k.a. recall rate) as

tp tpþfn,

the genotype preservation percentage

cgt tp , 22

tn and the speciﬁcity as tnþfp following the speciﬁcaas 1 tions, where cgt is the number of variants from the set of true positives that changed their genotype classiﬁcation from hom*ozygous to heterozygous or vice versa. The variant recovery precision and sensitivity analysis reveal how much k-means clustering quantization has to sacriﬁce in terms of false positive and false negative calls. The genotype preservation percentage indicates how well each quantization mode preserves the predicted genotype of called variants. The variant recovery speciﬁcity measures the proportion of negatives (ie, invariants) which are correctly identiﬁed. In ﬁgure 6, we plotted the percentage of these four statistical measures for NA12878chrom20 and HG00096chrom11, respectively. The speciﬁc statistics of these four statistical measures corresponding to ﬁgure 6 are included in supplementary table S2 in appendix 3. Figure 6 shows that, as expected, with lower k’s (ie, higher CRs) one can observe more fp þ fn as well as less tp þ tn with incorrectly predicted genotypes. However, it does not mean that tp (or tn) with lower k is smaller or fp (or fn) with lower k is larger than that with higher k, which can be also inferred from online supplementary table S2. That is why HUGO_L with six clusters (ie, HUGO_L(6)) performs worse than the situation with one and two clusters in terms of variant

Li P, et al. J Am Med Inform Assoc 2014;21:363–373. doi:10.1136/amiajnl-2013-002147

Downloaded from jamia.bmj.com on October 3, 2014 - Published by group.bmj.com

Research and applications Figure 5 The distribution of k-means clustering results with k=10. The vertical axis shows the occurring times for each quality value in the log scale. Since n is 51 in this sample, there is a corresponding number of unique quality values in the original ‘Quality value’ ﬁeld, and we quantize these integers into 10 clusters (red bars) using the k-means clustering scheme. Quality values in the same cluster are depicted by the same color.

recovery precision and variant recovery speciﬁcity. In any case, in ﬁgure 6, when k ≥ 10 the proposed algorithm results in reasonable accuracy (ie, the accuracy rates of four statistical measures are all above 95%). HG00096chrom11 generates more accurate results than NA12878chrom20 using the same lossy mode k. Moreover, the proposed lossy quality values quantization schemes outperform the lossy mode of CRAM (ie,

CRAM-l) with higher variant recovery, higher variant recovery speciﬁcity, and higher genotype preservation rates. However, HUGO_L(1) and HUGO_L(2) show lower variant recovery sensitivity (ie, more false negatives) than CRAM-l. Clearly, there is not a single winner but our experiment empirically suggests that we can reduce storage without sacriﬁcing much data utility by selecting the appropriate cluster number k.

Figure 6 The impact of lossy compression on variant callings based on (left) NA12878chrom20 and (right) HG00096chrom11 sequences using SAMTOOLS. Here ‘Hierarchical mUlti-reference Genome cOmpression (HUGO)’ refers to the lossless compression of quality values. ‘HUGO_L’ refers to the lossy compression of quality values using the k-means algorithm, and the number in the bracket indicates the value of k. ‘CRAM-l’ is the lossy mode of CRAM with the lossy compression command line ‘--capture-insertion-quality-scores --capture-piled-quality-scores --capture-substitution-quality-scores --capture-unmapped-quality-scores --capture-all-tags’. Li P, et al. J Am Med Inform Assoc 2014;21:363–373. doi:10.1136/amiajnl-2013-002147

369

Downloaded from jamia.bmj.com on October 3, 2014 - Published by group.bmj.com

Research and applications Encoding techniques for ﬁelds other than ‘Sequence’ and ‘Quality value’ ﬁelds

decompression time (CT/DT) for different settings. For all input sequences, bzip2 showed the worst compression performance, as it was designed for fast compression of general-purpose data. The failure of general-purpose compression algorithms is one of the crucial reasons why biomedical researchers pursue alternatives for efﬁcient storage of large genome data. Due to the use of per-position models, Samcomp seems to provide the best compression efﬁciency over all input sequences. However, it is important to note some limitations of Samcomp, which is not a full-ﬁeld SAM/BAM compressor because it primarily focuses on identiﬁers, sequence, and quality values. The SAM header, auxiliary ﬁelds, and the template ﬁelds in columns 7–9 are not preserved. For a fair comparison with Samcomp, we present the results of ‘HUGO*’ in table 2 when auxiliary ﬁelds and columns 7–9 are removed. Since the proposed method HUGO and the CRAM tool only store the number of each query name instead of the full name, the decompressed sequences of HUGO/CRAM act as the input sequences for Samcomp. Under these equivalent conditions, we can see that HUGO* achieves similar CRs as Samcomp, although its CPU costs (time and memory) are higher than those of Samcomp. In this study, we focus on HUGO’s adaptive length and highly tuned encoding and its compression performance. Table 2 illustrates that our tool achieved lossless CRs between 0.5 and 0.65, which corresponds to space savings of 35–50%. For the lossy compression modes, the proportion of space savings increases up to 50–90% by quantizing the quality values with k-means clustering. HUGO also saves 6–20% space compared to the CRAM tool for the same lossless or lossy compression mode. Although the proposed HUGO scheme achieves better compression efﬁciency, the compression time is increased by 20– 40% when compared with CRAM. Additionally, the maximum memory usage of compression doubles over the experimental sets, as shown in table 3. It is worth mentioning that the memory consumption in ‘HUGO’ during compression is mainly due to the fast alignment tool SOAP3. In contrast, the memory usage for decompression in HUGO is much less than CRAM. In addition, the execution time can be considerably reduced when the SAM format is used as an input, since the conversion between SAM and BAM formats using SAMTOOLS introduces a large compression overhead.

The distribution regularity of each ﬁeld and corresponding compression methods are described in online supplementary appendix 2. We summarize the speciﬁc compression scheme for each ﬁeld in table 1. We tested the compression performance for the corresponding 10 ﬁelds in NA12878chrom20 and HG00096chrom11 sequences. The original size of each individual SAM ﬁeld is determined by the ﬁle size of the uncompressed ﬁeld extracted from the corresponding SAM ﬁle. As shown in table 1, the compressed ﬁles are much smaller when compared to the original ones. At decompression, we ﬁrst decompress and transform the 12 ﬁelds that consist in principle of the 11 mandatory and all optional SAM ﬁelds independently. Then, we reconstruct the SAM ﬁle from the 12 separate decompressed ﬁles, each of which contains information on one ﬁeld, except that the twelfth ﬁle contains the information of all optional ﬁelds. For the ‘Sequence’ ﬁeld, because the IMRs/UMRs in each alignment are sent to the next alignment during the compression, we have to recover reads from the last alignment to the ﬁrst alignment during the decompression and therefore decompress the IMRs/ UMRs at each alignment. Consequently, the decompressed reads remain the same as the original ones. Since each ﬁeld’s content has the same order, we concatenate the 12 fragments from the 12 ﬁles in a record-by-record way in SAM’s standard format (ie, ‘QNAME’, ‘FLAG’, ‘RNAME’, ‘POS’, ‘MAPQ’, ‘CIGAR’, ‘MRNM’, ‘MPOS’, ‘TLEN’, ‘SEQ’, ‘QUAL’, ‘OPT’). The names of candidate references are recorded directly from the users’ command line settings. If needed, we convert the generated SAM ﬁle to the BAM format using SAMTOOLS.

RESULTS We implemented both our encoder and decoder in C++ and ran experiments on a Linux workstation with a 2.4 GHz Intel Xeon CPU, 96 GB of memory, and an NVIDIA Tesla M2090 GPU. In addition, we tested the performances of the proposed HUGO codec with different setups over the datasets taken from several types of NGS data; the results were also compared with several existing compression algorithms, such as bzip2, CRAM/CRAM-l, and Samcomp. The latest CRAM V.2.0 and Samcomp V.0.7 were used in the comparison. The command line parameters used for bzip2, CRAM, CRAM-l, and Samcomp were, respectively, ‘-z’, ‘--capture-all-quality-scores –capture-all-tags’, ‘–captureinsertion-quality-scores –capture-piled-quality-scores –capturesubstitution-quality-scores –capture-unmapped-quality-scores –capture-all-tags’ and ‘’. The results of our evaluation are summarized in table 2, which includes the compressed size, CR, and compression/

LIMITATIONS AND DISCUSSION Even though EMRs can be efﬁciently stored, IMRs and UMRs, which account for 20–50% of total reads in a sequence, are not as well compressed by traditional methods, and may dominate the ﬁnal storage size. The key idea of our proposed method is to use a tunable multi-reference based scheme for different resolution of reads. However, in several instances there was no

Table 1 Compression results for the corresponding 10 fields in NA12878chrom20 and HG00096chrom11 sequences Field

QNAME

FLAG

RNAME

POS

MAPQ

CIGAR

MRNM

MPOS

TLEN

OPT

Total

Compression scheme NA12878 chrom20 Original size (MB) Compressed size HG00096 chrom11 Original size (MB) Compressed size

RLE

DC/HC

LZW

RLE

DC/HC

bzip2

–

87.1 2.2

16.6 1.9

13.7 0.1

40.4 3.2

13.6 0.9

22.8 2.7

9.2 0.1

40.3 5.9

19.7 1.7

696.0 11.9

659.4 30.6

113.0 2.2

21.5 2.0

18.3 0.1

56.2 5.5

18.2 1.2

33.0 2.8

12.2 0.2

56.0 7.7

27.3 2.0

1127.0 21.9

1482.7 45.6

DC, Delta coding; HC, Huffman coding; LZW, Lempel-Ziv-Welch coding; RLE, run-length encoding.

370

Li P, et al. J Am Med Inform Assoc 2014;21:363–373. doi:10.1136/amiajnl-2013-002147

Downloaded from jamia.bmj.com on October 3, 2014 - Published by group.bmj.com

Research and applications Table 2 Comparisons of compression efficiency among the proposed HUGO/HUGO_L (lossy compression using k clusters), bzip, CRAM, and Samcomp for a broad types of sequences Input sequence

Program name

Reference

BAM size

NA12878 chrom20

bzip2 Samcomp HUGO* CRAM HUGO HUGO HUGO_L(30) HUGO_L(20) HUGO_L(10) HUGO_L(02) HUGO_L(01) CRAM-l bzip2 Samcomp HUGO* CRAM HUGO HUGO HUGO_L (30) HUGO_L (20) HUGO_L (10) HUGO_L(02) HUGO_L (01) CRAM-l bzip2 Samcomp HUGO* CRAM HUGO HUGO bzip2 Samcomp HUGO* CRAM HUGO HUGO bzip2 Samcomp HUGO* CRAM HUGO HUGO bzip2 Samcomp HUGO* CRAM HUGO HUGO Samcomp HUGO* CRAM HUGO HUGO Samcomp HUGO* CRAM HUGO Samcomp HUGO* CRAM HUGO Samcomp HUGO* CRAM HUGO

hg19 hg19 hg19 hg19 hg19 hg19,HuRef hg19 hg19 hg19 hg19 hg19 hg19 hg19 hg19 hg19, HuRef hg19 hg19 hg19, HuRef hg19 hg19 hg19 hg19 hg19, HuRef hg19 hg19 hg19 hg19, HuRef hg19 hg19 hg19, HuRef hg19 hg19 hg19, HuRef hg19 hg19 hg19, HuRef hg19 hg19 hg19, HuRef hg19 hg19 hg19, HuRef hg19 hg19 hg19, HuRef hg19 hg19 hg19, HuRef hg19 hg19, HuRef hg19 hg19 hg19, HuRef NC_000913 NC_000913 NC_000913 NC_000913 mouse mouse mouse mouse hg19 hg19 hg19 hg19

356 MB

358 MB 163 MB 169 MB 217 MB 189 MB 189 MB 187 MB 143 MB 98 MB 56 MB 53 MB 54 MB 663 MB 304 MB 315 MB 411 MB 348 MB 346 MB 343 MB 271 MB 151 MB 75 MB 72 MB 71 MB 720 MB 333 MB 344 MB 454 MB 382 MB 380 MB 967 MB 458 MB 476 MB 585 MB 532 MB 529 MB 1.192 GB 504 MB 542 MB 737 MB 634 MB 630 MB 2.34 GB 1188 MB 1295 MB 1570 MB 1456 MB 1451 MB 6936 MB 7165 MB n/a 7965 MB 7919 MB 79 MB 96 MB 102 MB 101 MB 270 MB 275 MB 289 MB 299 MB 8041 MB 8313 MB 19 586 MB 16 759 MB

– 0.46 0.47 0.61 0.53 0.53 0.53 0.40 0.28 0.16 0.15 0.15 – 0.46 0.48 0.62 0.53 0.52 0.52 0.41 0.23 0.11 0.11 0.11 – 0.46 0.48 0.63 0.53 0.53 – 0.48 0.49 0.61 0.55 0.55 – 0.41 0.44 0.60 0.52 0.52 – 0.50 0.54 0.66 0.61 0.61 0.47 0.48 n/a 0.54 0.53 0.47 0.57 0.60 0.60 0.44 0.45 0.48 0.49 0.27 0.27 0.64 0.55

2 min 5 s 50 s 2 min 38 s 2 min 30 s 3 min 50 s 3 min 49 s 4 min 00 s 3 min 55 s 3 min 50 s 3 min 41 s 3 min 40 s 2 min 20 s 3 min 15 s 1 min 30 s 3 min 20 s 4 min 00 s 5 min 01 s 5 min 00 s 5 min 10 s 5 min 05 s 4 min 58 s 4 min 55 s 4 min 56 s 3 min 55 s 3 min 34 s 1 min 36 s 3 min 50 s 5 min 30 s 5 min 33 s 5 min 32 s 4 min 45 s 2 min 06 s 4 min 15 s 5 min 16 s 7 min 38 s 7 min 38 s 6 min 08 s 2 min 32 s 4 min 0 s 6 min 43 s 9 min 25 s 9 min 20 s 11 min 35 s 5 min 42 s 8 min 34 s 12 min 40 s 17 min 18 s 17 min 15 s 33 min 40 s 1 h 20 min n/a 2 h 30 min 2 h 30 min 30 s 1 min 8 s 1 min 30 s 2 min 35 s 1 min 25 s 3 min 01 s 4 min 40 s 4 min 42 s 50 min 27 s 2 h 5 min 2 h 1 min 4 h 10 min

1 min 32 s 49 s 2 min 00 s 2 min 25 s 2 min 55 s 2 min 57 s 2 min 58 s 2 min 55 s 2 min 39 s 1 min 40 s 1 min 40 s 2 min 20 s 1 min 30 s 1 min 44 s 2 min 38 s 4 min 30 s 4 min 20 s 4 min 15 s 4 min 14 s 4 min 09 s 4 min 02 s 2 min 38 s 2 min 30 s 4 min 31 s 1 min 30 s 1 min 55 s 2 min 49 s 5 min 15 s 4 min 40 s 4 min 38 s 2 min 10 s 2 min 39 s 2 min 58 s 7 min 00 s 5 min 20 s 5 min 22 s 2 min 32 s 3 min 04 s 3 min 42 s 9 min 00 s 8 min 55 s 8 min 50 s 5 min 10 s 6 min 21 s 7 min 14 s 15 min 20 s 12 min 43 s 12 min 40 s 38 min 20 s 1 h 20 min n/a 2 h 20 min 2 h 20 min 52 s 1 min 45 s 1 min 32 s 2 min 30 s 1 min 30 s 3 min 15 s 4 min 17 s 6 min 10 s 53 min 38 s 2 h 2 min 3 h 8 min 4 h 8 min

HG00096 chrom11

HG00103 chrom11

HG01028 chrom11

NA06984 chrom11

NA06985 chrom11

HG00096 mapped

RNA-Seq ERR019653

ChIP-Seq SRR032209

Cancer HCC1954

661 MB

717 MB

964 MB

1.19 GB

2.33 GB

14.53 GB

169 MB

608 MB

29.7 GB

Here, ‘n/a’ indicates that an Exception error occurred because some reference sequences were not found in the fasta file when compressing the input file ‘HG00096mapped’ using CRAM. CR, compression ratio; CS, compressed size; CT, compression time; DT, decompression time; HUGO, Hierarchical mUlti-reference Genome cOmpression.

Li P, et al. J Am Med Inform Assoc 2014;21:363–373. doi:10.1136/amiajnl-2013-002147

371

Downloaded from jamia.bmj.com on October 3, 2014 - Published by group.bmj.com

Research and applications Table 3 Comparison of maximum memory usage among the proposed HUGO method, CRAM, and Samcomp, for different types of sequences Memory usage Input sequence

Program name

BAM size

SAM size

Compression

Decompression

NA12878chrom20

CRAM Samcomp HUGO CRAM Samcomp HUGO CRAM Samcomp HUGO CRAM Samcomp HUGO CRAM Samcomp HUGO CRAM Samcomp HUGO CRAM Samcomp HUGO CRAM Samcomp HUGO CRAM Samcomp HUGO CRAM Samcomp HUGO

356 MB

1.58 GB

661 MB

2.65 GB

717 MB

2.91 GB

964 MB

3.95 GB

1.19 GB

5.16 GB

2.33 GB

9.41 GB

14.53 GB

60.1 GB

169 MB

1.32 GB

608 MB

2.36 GB

29.7 GB

92.8 GB

6.7 GB 292 MB 14 GB 8.1 GB 300 MB 15 GB 8.0 GB 301 MB 16 GB 8 GB 301 MB 16 GB 8.2 GB 300 MB 19 GB 8.1 GB 300 MB 19 GB n/a 314 MB 45 GB 6.2 GB 285 MB 3.2 GB 8.1 GB 300 MB 14.8 GB 9.6 GB 320 MB 46 GB

6.5 GB 320 MB 1.9 GB 4.3 GB 320 MB 2.4 GB 4.6 GB 320 MB 2.6 GB 3.7 GB 320 MB 3.2 GB 4.1 GB 320 MB 3.5 GB 4.3 GB 320 MB 4.0 GB n/a 320 MB 34 GB 3.3 GB 290 MB 1.7 GB 4.3 GB 318 MB 2.3 GB 9.1 GB 330 MB 37 GB

HG00096chrom11

HG00103chrom11

HG01028chrom11

NA06984chrom11

NA06985chrom11

HG00096mapped

RNA-Seq ERR019653

ChIP-Seq SRR032209

Cancer HCC1954

HUGO, Hierarchical mUlti-reference Genome cOmpression; SAM, Sequence Alignment/Map.

obvious (ie, CR decreases by at least 0.05) improvement in the rate of EMRs when compared to direct alignment of the shortened reads against hg19 reference alone. The reason is that almost all target sequences are close to the reference sequence hg19, and we do not have other reference sequences that represent these sequences any better. As shown in ﬁgure 3, the adoption of a multi-reference based structure even generates slightly less exact mapped reads than the single reference based framework for the input sequence NA12878chrom20. Our method is limited by the availability of ‘useful’ reference genomes, which might have identical or near-identical structure characteristics to those of the sequence being compressed. For this reason, the order of the reference genomes used in hierarchical alignment is also important. The best yield can be achieved when the most similar reference to the source reads is chosen, even though part of these differences stems from sequencing errors and not from genetic variation. This could be done, for example, by matching phenotype characteristics such as ethnicity or disease, or by leveraging additional information such as family data. For example, Illumina recently started to provide whole genome sequence and variant call data for 17 members of the Coriell CEPH/UTAH 1463 family (http://www.illumina.com/platinumgenomes/) in order to create a ‘platinum’ standard comprehensive set of variant calls. We expect that our multi-reference method can be useful for compressing ‘pedigree’ genomes that are becoming increasingly available. For genomes with a high number of somatic mutations it is still difﬁcult to ﬁnd an appropriate reference, but as more data become available the similarity data could be recorded for purposes other than compression. 372

On the other hand, the compression time, required memory, and the success of our proposed method strongly depend on the SOAP3 alignment tool, which sets a high bar for the running environment. Users need a Linux workstation equipped with a multi-core CPU (default quad-core) with at least 20 GB main memory and a CUDA-enabled GPU with compute capability 2.0, if the GPU-based alignment tool is used, and at least 3 GB memory (default 6 GB). Fortunately, new alignment tools are being developed with increasing speed and sensitivity every year, which naturally boosts the efﬁciency of our proposed method.

CONCLUSION Storage and transmission are important challenges in the use of large sequencing ‘Big Data’. We developed a novel compression technique, the HUGO framework, for compressing aligned reads. Our method also presents an innovative way of hierarchically matching gradually shortened reads in order to make full use of available reference genomes. Our experiments compared the performance of our algorithm with other state-of-the-art compression algorithms, such as CRAM, to which it was superior, and Samcomp, which had similar compression performance. Contributors PL and XJ designed the study and were responsible for the experiments, SW assisted with GPU computing and manuscript preparation, JK assisted with the analyses, HX and LO-M wrote parts and edited all versions of the manuscript. Funding This work was supported by NIH grant numbers K99LM011392 and U54HL108460, NLM grant number R00LM011392, and NSFC grant numbers U1201255, 61271218 and 61228101. Li P, et al. J Am Med Inform Assoc 2014;21:363–373. doi:10.1136/amiajnl-2013-002147

Downloaded from jamia.bmj.com on October 3, 2014 - Published by group.bmj.com

Research and applications Competing interests None.

Provenance and peer review Not commissioned; externally peer reviewed. Open Access This is an Open Access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 3.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited and the use is non-commercial. See: http://creativecommons.org/ licenses/by-nc/3.0/

REFERENCES

1 2 3 4 5 6

7 8 9 10 11 12 13 14

Ng SB, Turner EH, Robertson PD, et al. Targeted capture and massively parallel sequencing of 12 human exomes. Nature 2009;461:272–6. Abecasis GR, Altshuler D, Auton A, et al. A map of human genome variation from population-scale sequencing. Nature 2010;467:1061–73. The 1000 Genomes Project Consortium. An integrated map of genetic variation from 1,092 human genomes. Nature 2012;491:56–65. Sequence Read Archive. http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi? view=announcement/ Altschul SF, Gish W, Miller W, et al. Basic local alignment search tool. J Mol Biol 1990;215:403–10. Liu Y, Schmidt B, Maskell DL. CUDASW++ 2.0: enhanced Smith-Waterman protein database search on CUDA-enabled GPUs based on SIMT and virtualized SIMD abstractions. BMC Res Notes 2010;3:93. Langmead B, Trapnell C, Pop M, et al. Ultrafast and memory-efﬁcient alignment of short DNA sequences to the human genome. Genome Biol 2009;10:R25. Pearson WR, Lipman DJ. Improved tools for biological sequence comparison. Proc Natl Acad Sci 1988;85:2444–8. Langmead B. Aligning short sequencing reads with Bowtie. Curr Protoco Bioinformatics 2010;32:11.7.1–11.7.14. Jorde LB, Wooding SP. Genetic variation, classiﬁcation and ‘race’. Nat Genet 2004;36:S28–33. Li H, Handsaker B, Wysoker A, et al. The sequence alignment/map format and SAMtools. Bioinformatics 2009;25:2078–9. Bentley DR, Balasubramanian S, Swerdlow HP, et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 2008;456:53–9. Margulies M, Egholm M, Altman WE, et al. Genome sequencing in microfabricated high-density picolitre reactors. Nature 2005;437:376–80. Dorff KC, Chambwe N, Zeno Z, et al. GobyWeb: simpliﬁed management and analysis of gene expression and DNA methylation sequencing data. Quantitative Methods arXiv 2012;1211:6666.

Li P, et al. J Am Med Inform Assoc 2014;21:363–373. doi:10.1136/amiajnl-2013-002147

17 18

22 23 24 25

26 27 28 29 30 31

DePristo MA, Banks E, Poplin R, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet 2011;43:491–8. McKenna A, Hanna M, Banks E, et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res 2010;20:1297–303. Sakib MN, Tang J, Zheng WJ, et al. Improving transmission efﬁciency of large sequence alignment/map (SAM) ﬁles. PloS ONE 2011;6:e28251. Ohno-Machado L, Bafna V, Boxwala AA, et al. iDASH: integrating data for analysis, anonymization, and sharing. J Am Med Inform Assoc 2012;19:196–201. Hach F, Numanagić I, Alkan C, et al. SCALCE: boosting sequence compression algorithms using locally consistent encoding. Bioinformatics 2012;28:3051–7. Fritz MH-Y, Leinonen R, Cochrane G, et al. Efﬁcient storage of high throughput DNA sequencing data using reference-based compression. Genome Res 2011;21:734–40. Jones DC, Ruzzo WL, Peng X, et al. Compression of next-generation sequencing reads aided by highly efﬁcient de novo assembly. Nucleic Acids Res 2012;40:e171. Popitsch N, von Haeseler A. NGC: lossless and lossy compression of aligned high-throughput sequencing data. Nucleic Acids Res 2013;41:e27. Bonﬁeld JK, Mahoney MV. Compression of FASTQ and SAM format sequencing data. PloS ONE 2013;8:e59190. Liu C-M, Wong T, Wu E, et al. SOAP3: ultra-fast GPU-based parallel alignment tool for short reads. Bioinformatics 2012;28:878–9. Genome. ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/pilot2_high_cov_ GRCh37_bams/data/NA12878/alignment/NA12878.chrom20.ILLUMINA.bwa.CEU. low_coverage.20111114.bam Cleary J, Witten I. Data compression using adaptive coding and partial string matching. IEEE Trans Commun 1984;32:396–402. Moffat A. Implementing the PPM data compression scheme. IEEE Trans Commun 1990;38:1917–21. Kozanitis C, Saunders C, Kruglyak S, et al. Compressing genomic sequence fragments using SlimGene. J Comput Biol 2011;18:401–13. Deorowicz S, Grabowski S. Compression of DNA sequence reads in FASTQ format. Bioinformatics 2011;27:860–2. Xu H, Luo X, Qian J, et al. FastUniq: a fast de novo duplicates removal tool for paired short reads. PloS ONE 2012;7:e52249. MacQueen J. Some methods for classiﬁcation and analysis of multivariate observations. In: Proceedings of the ﬁfth Berkeley symposium on mathematical statistics and probability; California, USA: 1967.

373

Downloaded from jamia.bmj.com on October 3, 2014 - Published by group.bmj.com

HUGO: Hierarchical mUlti-reference Genome cOmpression for aligned reads Pinghao Li, Xiaoqian Jiang, Shuang Wang, et al. J Am Med Inform Assoc 2014 21: 363-373 originally published online December 24, 2013

doi: 10.1136/amiajnl-2013-002147

Updated information and services can be found at: http://jamia.bmj.com/content/21/2/363.full.html

These include:

Data Supplement

"Supplementary Data" http://jamia.bmj.com/content/suppl/2013/12/24/amiajnl-2013-002147.DC1.html

References

This article cites 24 articles, 10 of which can be accessed free at: http://jamia.bmj.com/content/21/2/363.full.html#ref-list-1

Open Access

This is an Open Access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 3.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/3.0/

Email alerting service

Receive free email alerts when new articles cite this article. Sign up in the box at the top right corner of the online article.

Topic Collections

Articles on similar topics can be found in the following collections Open access (168 articles)

Notes

To request permissions go to: http://group.bmj.com/group/rights-licensing/permissions

To order reprints go to: http://journals.bmj.com/cgi/reprintform

To subscribe to BMJ go to: http://group.bmj.com/subscribe/

HUGO: Hierarchical mUlti-reference Genome cOmpression for aligned reads. - PDF Download Free (2024)

FAQs

Where to download reference genome? ›

Can you download the human genome? ›

References