Tiezheng Yuan Ph.D.: NGS: Interpretations of SAM format

Abstract: Howe to analyze SAM files.

The definition of the SAM format can be downloaded from http://samtools.github.io/hts-specs/SAMv1.pdf. Be default, a SAM file begins with SAM header, and then the records of alignment with at least 12 fields separated by tabs.

SAM header

Be default, the SAM header begins with @HD, @SQ, @RG and @PG lines. @HD line would be the first line if present. @SQ are the dictionary of reference sequences.

Dictionary of alignment record

Here is an example of alignment record of paired-end alignment:

R0230412_0118:1:1103:9799:151900#ATGTCA 177 1 10541 0 50M 3 198180787 0 CCGAAATCTGTGCAGAGGAGAACGCAGCTCCGCCCTCGCGGTGCTCTCCG ccccccccccccccdddccceeegggihhhihiiiiigggggeeeeebbb AS:i:-6 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:19C30 YT:Z:UU NH:i:20 CC:Z:12 CP:i:10658 HI:i:5

1: Name of the read. Here, that is consistent with the name (the first line) from FASTQ file

2: Sum of all applicable flags. The details would be described below.

3: Name of reference if alignment occurs, which is consistent with the name of references in FASTA format file. Or would be ‘*’ if no alignment occurs.

4: The most-left-side offset at the forward strand of reference.

5: Mapping quality.

6: CIGAR string. It can be used for mutation analysis.

7: Name of reference for the alignment of the other read.

8: The most-left-side offset for the alignment of the other read.

9: Inferred fragment.

10: Read sequences. It would be reverse complemented if it is aligned to the reverse strand of reference.

11:ASCII-encoded read qualities.

12: Optional fields separated by tabs.

The flags of paired-end alignment

The sum of flags for paired-end alignment is always the odd.

Decimal	Binary	Hexadecimal	Description
1	000000000001	0x1	This read is one of paired reads
2	000000000010	0x2	This read is properly aligned to the forward strand of reference for paired-end alignment. The "properly aligned" means that they mapped within a proper distance from each other (concordant pairs), the parameters of which are designed by -I/-x in Bowtie2.
4	000000000100	0x4	This read is not aligned.
8	000000001000	0x8	This read is no aligned, and is one of paired reads.
16	000000010000	0x10	This read is aligned to the reverse-complementary strand of reference.
32	000000100000	0x20	The other read of paired reads is aligned to the reverse strand of reference.
64	000001000000	0x40	This read is mate 1 in paired reads (R1).
128	000010000000	0x80	This read is mate 2 in paired reads (R2).
256	000100000000	0x100	Secondary alignment for multiple alignments
512	001000000000	0x200	Not passing quality controls
1024	010000000000	0x400	PCR or optical duplicate
2048	100000000000	0x800	Supplementary alignment

The next issue is how to judge the alignment. The number of reads with flags should be matched to each other. The lower number would be from the mate 1 (R1), and the higher would be from the mate 2 (R2). Something must be wrong with your alignment if those paired numbers don't match. Here are some examples:

Both alignment: Good alignment reveals that both reads in the paired reads are successfully aligned to the reference. The difference between the two flags would be 64 mostly, or 80, or 48.

65 - 000001000001 – This read is the first read (R1) of a pair. both reads are aligned to the forward strand.

129 - 000010000001 - This read is the second read (R2) of a pair. Both reads are aligned to the forward strand.

67 - 000001000001 – This read is the first read (R1) in a concordant pair. Both reads are properly aligned to the forward strand.

131 - 000010000001 - This read is the second read (R2) in a concordant pair. Both reads are properly aligned to the forward strand.

113 - 000001110001 - This is the first read (R1) of a pair. Both reads in pair were aligned to reverse complementary strands of reference.

177 - 000010110001 - This is the second read (R2) of a pair. Both reads in pair were aligned to reverse complementary strands of reference.

81 - 000001010001 - This is the first read (R1), and aligned to the reverse complementary strand of reference. Both reads are aligned.

161 - 000010100001 - This is the second read (R2), and the other read (R1) aligned to the reverse complementary strand of reference. Both reads are aligned.

83 - 000001010011 - This is the first read (R1), and aligned to the reverse complementary strand of reference. Both reads are properly aligned.

163 - 000010100011 - This is the second read (R2), and the other read (R1) aligned to the reverse complementary strand of reference. Both reads are properly aligned.

97 - 000001100001 - This is first read, its mate is flipped but this is forward. Both reads are aligned.

145 - 000010010001 - This is second read. it is flipped but its mate is not. Both reads are aligned.

90 - 000001100011 - This is first read, its mate is flipped but this is forward. Both reads are properly aligned.

147 - 000010010011 - This is second read. it is flipped but its mate is not. Both reads are properly aligned.

One alignment: That reveals only one read is aligned ,and the other one is unaligned.

69 - 000001000101 – The first read (R1) in pair. This read is unaligned, and the other read is unaligned.

137 - 000010001001 – the second (R2) in pair. This read is aligned, and the other read is unaligned.

73 - 000001001001 – The first read (R1) in pair. This read is aligned, and the other read is unaligned.

133 - 000010000101 - The second read (R2) in pair. This read is unaligned, and the other read is aligned.

No alignment: Both of the reads of a pair is unaligned.

77 - 000001001101 - The first read (R1) in pair. Both of the reads are unaligned.

141 - 000010001101 - The second read (R2) in pair. Both of the reads are unaligned.

Multiple alignment: It is also known as ambiguous alignment. The sum of flags is always more than 300 because the flag contains the bit of 0x100 (256). Here, I provide two examples.

First one is two paired-end alignments for a pair aligned to the identical regions with different chromosomal positions, the sum of the flags are 99 (64+32+2+1) vs. 147 (128+16+2+1) and 355 (256+64+32+2+1) vs. 403 (256+128+16+2+1). The first alignment might be real, and the second one may not be real because the reads of this pair are mapped into different chromosome.

R0230412_0118:1:1102:18601:110990#ATGTCA 99 1 147890 3 18M2I30M = 147961 121 GAGCAAGCCCCTATCTAGAAAAAAAAAAAATGTCAGTGAAGATGTGGAGG bbbeeeeeggggghiiiiiiiihiiiiiid]\V\Zd^^^^\^ZaM^_Z`` AS:i:-11 XN:i:0 XM:i:0 XO:i:1 XG:i:2 NM:i:2 MD:Z:48 YT:Z:UU XS:A:- NH:i:2 CC:Z:= CP:i:748199 HI:i:0

R0230412_0118:1:1102:18601:110990#ATGTCA 147 1 147961 3 50M = 147890 -121 GGTGGGAACATAAAATTGTGTAACCATTTTGTTTGGGTATTTCTTTTCTT iiiiihihiiiihihiiihhghgdgeghiihiiiiiigggggeeeeebbb AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:50 YT:Z:UU XS:A:- NH:i:2 CC:Z:= CP:i:748270 HI:i:0

R0230412_0118:1:1102:18601:110990#ATGTCA 355 1 748199 3 18M2I30M = 748270 121 GAGCAAGCCCCTATCTAGAAAAAAAAAAAATGTCAGTGAAGATGTGGAGG bbbeeeeeggggghiiiiiiiihiiiiiid]\V\Zd^^^^\^ZaM^_Z`` AS:i:-11 XN:i:0 XM:i:0 XO:i:1 XG:i:2 NM:i:2 MD:Z:48 YT:Z:UU XS:A:- NH:i:2 HI:i:1

R0230412_0118:1:1102:18601:110990#ATGTCA 403 1 748270 3 50M = 748199 -121 GGTGGGAACATAAAATTGTGTAACCATTTTGTTTGGGTATTTCTTTTCTT iiiiihihiiiihihiiihhghgdgeghiihiiiiiigggggeeeeebbb AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:50 YT:Z:UU XS:A:- NH:i:2 HI:i:1

However, multiple alignment sometimes make sense. The second example of multiple alignment comes from RNA splicing. The sum of the flags are 83 vs. 163, 339 vs. 419, and 353 vs. 401.

#splicing 1

R0230412_0118:1:2307:9799:151900#ATGTCA 83 17 7626607 1 8M138N41M149N1M = 7626471 -473 AAGGCCACCTCAGCCACCTTTTTGATTATTTTGGAACCAATCCCTTGCCC hhffdeahgdhhhhfgihfiiihhhhfiiiihfffghfggegeeca\Pa_ AS:i:-2 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:47A2 YT:Z:UU XS:A:- NH:i:3 CC:Z:= CP:i:7626607 HI:i:0

R0230412_0118:1:2307:9799:151900#ATGTCA 163 17 7626471 1 50M = 7626607 473 TGCCTCTCCTTGAAAGCAGAAGAAGTGCCAGCCCTCAGCTTCCGTCAGAT _abeeceegfcg^``egadbggdbdbefggfdfhhfhhddgdhhgeff^c AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:50 YT:Z:UU XS:A:- NH:i:3 CC:Z:= CP:i:7626471 HI:i:0

#splicing 2

R0230412_0118:1:2307:9799:151900#ATGTCA 339 17 7626607 1 8M138N41M349N1M = 7626471 -673 AAGGCCACCTCAGCCACCTTTTTGATTATTTTGGAACCAATCCCTTGCCC hhffdeahgdhhhhfgihfiiihhhhfiiiihfffghfggegeeca\Pa_ AS:i:-2 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:47A2 YT:Z:UU XS:A:- NH:i:3 CC:Z:= CP:i:7626607 HI:i:1

R0230412_0118:1:2307:9799:151900#ATGTCA 419 17 7626471 1 50M = 7626607 673 TGCCTCTCCTTGAAAGCAGAAGAAGTGCCAGCCCTCAGCTTCCGTCAGAT _abeeceegfcg^``egadbggdbdbefggfdfhhfhhddgdhhgeff^c AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:50 YT:Z:UU XS:A:- NH:i:3 CC:Z:= CP:i:7626471 HI:i:1

#splicing 3

R0230412_0118:1:2307:9799:151900#ATGTCA 339 17 7626607 1 8M138N41M569N1M = 7626471 -893 AAGGCCACCTCAGCCACCTTTTTGATTATTTTGGAACCAATCCCTTGCCC hhffdeahgdhhhhfgihfiiihhhhfiiiihfffghfggegeeca\Pa_ AS:i:-2 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:47A2 YT:Z:UU XS:A:- NH:i:3 HI:i:2

R0230412_0118:1:2307:9799:151900#ATGTCA 419 17 7626471 1 50M = 7626607 893 TGCCTCTCCTTGAAAGCAGAAGAAGTGCCAGCCCTCAGCTTCCGTCAGAT _abeeceegfcg^``egadbggdbdbefggfdfhhfhhddgdhhgeff^c AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:50 YT:Z:UU XS:A:- NH:i:3 HI:i:2

the flags of single-end alignment

The sum of flags for single-end alignment is always the even.

Decimal	Binary	Hexadecimal	Description
0	000000000000	0x0	This read is aligned to the forward strand of reference
4	000000000100	0x4	This read is not aligned.
16	000000010000	0x10	This read is aligned to the reverse-complementary strand of reference.

The other fields

AS:i:<N>:Alignment score generated by aligner. The higher <N> would be better. It is usually more than 0, but It can be negative. It is only present for an aligned read.

NM:i:<N>, XM:i:<N>, and MD:Z:<N>: the edit distance, the number and positions of mismatches. Both of the fields are often combined with CIGAR string for analyzing mutations including SNPs/insert/delete. For example:The read length is 50 bases.

All matches	CIGAR=50M, XM:i:0 NM:i:0 MD:Z:50	All bases of the read matched to the reference.
One SNP	CIGAR=50M XM:i:1 NM:i:1 MD:Z:19C30	One SNP site (The base is C at the reference) is at the 20^th base of the read.
across an intron	CIGAR=21M659N29M XM:i:0 NM:i:0 MD:Z:50	All bases matched to the reference. But the read cross an intron on the DNA reference. The regions with 21 and 29 bases aligned to two exon, respectively. The intron length is 659 bases.
One insertion, and one SNP	CIGAR=45M1I4M XM:i:1 NM:i:2 MD:Z:47G1	There are two mismatched bases in the read. Of the read, the first 45 bases aligned to the reference, and the 46^th base is insertion, and the 47-50th bases aligned to the reference. One SNP (G in reference) at the 49^th base in the read is detected.
One insertion	CIGAR=12M2I36M XM:i:0 NM:i:2 MD:Z:48	There are two mismatched bases in the read. Of the read, the first 12 bases aligned to the reference, and the 13-14^th bases are insertion, and the last 36 bases aligned to the reference.
One deletion	CIGAR=15M2D35M XM:i:0 NM:i:2 MD:Z:15^AG35	There are two mismatched bases in the read. Of the read, the first 15 bases aligned to the reference, and the two bases after 15 th are deletion, and the last 35 bases aligned to the reference.
One deletion, and one SNP	CIGAR=13M1D37M XM:i:1 NM:i:2 MD:Z:0A12^A37	There are two mismatched bases in the read. Of the read, the first base is SNP (A in reference), and the 2-13th bases are aligned, and one base after the 13^th base is deletion, and the last 37 bases aligned to reference.

CC:i:<N> and CP:i:<N>: Chromosome name and offset of next alignment, '=' if on the same chr.

HI:i:<N>: hit index (increasing from from 0 to NH-1). The best hit does not have to be the first one.

Writing date: 2014.09.23, 2015.02.04

Tiezheng Yuan Ph.D.

Wednesday, February 4, 2015

NGS: Interpretations of SAM format

No comments:

Post a Comment