Wednesday, February 4, 2015

NGS: Interpretations of SAM format


Abstract: Howe to analyze SAM files.


The definition of the SAM format can be downloaded from http://samtools.github.io/hts-specs/SAMv1.pdf. Be default, a SAM file begins with SAM header, and then the records of alignment with at least 12 fields separated by tabs.

SAM header
Be default, the SAM header begins with @HD, @SQ, @RG and @PG lines. @HD line would be the first line if present. @SQ are the dictionary of reference sequences.

Dictionary of alignment record

Here is an example of alignment record of paired-end alignment:
R0230412_0118:1:1103:9799:151900#ATGTCA 177 1 10541 0 50M 3 198180787 0 CCGAAATCTGTGCAGAGGAGAACGCAGCTCCGCCCTCGCGGTGCTCTCCG ccccccccccccccdddccceeegggihhhihiiiiigggggeeeeebbb AS:i:-6 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:19C30 YT:Z:UU NH:i:20 CC:Z:12 CP:i:10658 HI:i:5
1: Name of the read. Here, that is consistent with the name (the first line) from FASTQ file
2: Sum of all applicable flags. The details would be described below.
3: Name of reference if alignment occurs, which is consistent with the name of references in FASTA format file. Or would be ‘*’ if no alignment occurs.
4: The most-left-side offset at the forward strand of reference.
5: Mapping quality.
6: CIGAR string. It can be used for mutation analysis.
7: Name of reference for the alignment of the other read.
8: The most-left-side offset for the alignment of the other read.
9: Inferred fragment.
10: Read sequences. It would be reverse complemented if it is aligned to the reverse strand of reference.
11:ASCII-encoded read qualities.
12: Optional fields separated by tabs.

The flags of paired-end alignment
The sum of flags for paired-end alignment is always the odd.
Decimal
Binary
Hexadecimal
Description
1
000000000001
0x1
This read is one of paired reads
2
000000000010
0x2
This read is properly aligned to the forward strand of reference for paired-end alignment. The "properly aligned" means that they mapped within a proper distance from each other (concordant pairs), the parameters of which are designed by -I/-x in Bowtie2.
4
000000000100
0x4
This read is not aligned.
8
000000001000
0x8
This read is no aligned, and is one of paired reads.
16
000000010000
0x10
This read is aligned to the reverse-complementary strand of reference.
32
000000100000
0x20
The other read of paired reads is aligned to the reverse strand of reference.
64
000001000000
0x40
This read is mate 1 in paired reads (R1).
128
000010000000
0x80
This read is mate 2 in paired reads (R2).
256
000100000000
0x100
Secondary alignment for multiple alignments
512
001000000000
0x200
Not passing quality controls
1024
010000000000
0x400
PCR or optical duplicate
2048
100000000000
0x800
Supplementary alignment

The next issue is how to judge the alignment. The number of reads with flags should be matched to each other. The lower number would be from the mate 1 (R1), and the higher would be from the mate 2 (R2). Something must be wrong with your alignment if those paired numbers don't match. Here are some examples:

Both alignment: Good alignment reveals that both reads in the paired reads are successfully aligned to the reference. The difference between the two flags would be 64 mostly, or 80, or 48.
65 - 000001000001 – This read is the first read (R1) of a pair. both reads are aligned to the forward strand.
129 - 000010000001 - This read is the second read (R2) of a pair. Both reads are aligned to the forward strand.

67 - 000001000001 – This read is the first read (R1) in a concordant pair. Both reads are properly aligned to the forward strand.
131 - 000010000001 - This read is the second read (R2) in a concordant pair. Both reads are properly aligned to the forward strand.

113 - 000001110001 - This is the first read (R1) of a pair. Both reads in pair were aligned to reverse complementary strands of reference.
177 - 000010110001 - This is the second read (R2) of a pair. Both reads in pair were aligned to reverse complementary strands of reference.

81 - 000001010001 - This is the first read (R1), and aligned to the reverse complementary strand of reference. Both reads are aligned.
161 - 000010100001 - This is the second read (R2), and the other read (R1) aligned to the reverse complementary strand of reference. Both reads are aligned.

83 - 000001010011 - This is the first read (R1), and aligned to the reverse complementary strand of reference. Both reads are properly aligned.
163 - 000010100011 - This is the second read (R2), and the other read (R1) aligned to the reverse complementary strand of reference. Both reads are properly aligned.

97 - 000001100001 - This is first read, its mate is flipped but this is forward. Both reads are aligned.
145 - 000010010001 - This is second read. it is flipped but its mate is not. Both reads are aligned.

90 - 000001100011 - This is first read, its mate is flipped but this is forward. Both reads are properly aligned.
147 - 000010010011 - This is second read. it is flipped but its mate is not. Both reads are properly aligned.

One alignment: That reveals only one read is aligned ,and the other one is unaligned.
69 - 000001000101 – The first read (R1) in pair. This read is unaligned, and the other read is unaligned.
137 - 000010001001 – the second (R2) in pair. This read is aligned, and the other read is unaligned.

73 - 000001001001 – The first read (R1) in pair. This read is aligned, and the other read is unaligned.
133 - 000010000101 - The second read (R2) in pair. This read is unaligned, and the other read is aligned.

No alignment: Both of the reads of a pair is unaligned.
77 - 000001001101 - The first read (R1) in pair. Both of the reads are unaligned.
141 - 000010001101 - The second read (R2) in pair. Both of the reads are unaligned.

Multiple alignment: It is also known as ambiguous alignment. The sum of flags is always more than 300 because the flag contains the bit of 0x100 (256). Here, I provide two examples.

First one is two paired-end alignments for a pair aligned to the identical regions with different chromosomal positions, the sum of the flags are 99 (64+32+2+1) vs. 147 (128+16+2+1) and 355 (256+64+32+2+1) vs. 403 (256+128+16+2+1). The first alignment might be real, and the second one may not be real because the reads of this pair are mapped into different chromosome.
R0230412_0118:1:1102:18601:110990#ATGTCA 99 1 147890 3 18M2I30M = 147961 121 GAGCAAGCCCCTATCTAGAAAAAAAAAAAATGTCAGTGAAGATGTGGAGG bbbeeeeeggggghiiiiiiiihiiiiiid]\V\Zd^^^^\^ZaM^_Z`` AS:i:-11 XN:i:0 XM:i:0 XO:i:1 XG:i:2 NM:i:2 MD:Z:48 YT:Z:UU XS:A:- NH:i:2 CC:Z:= CP:i:748199 HI:i:0
R0230412_0118:1:1102:18601:110990#ATGTCA 147 1 147961 3 50M = 147890 -121 GGTGGGAACATAAAATTGTGTAACCATTTTGTTTGGGTATTTCTTTTCTT iiiiihihiiiihihiiihhghgdgeghiihiiiiiigggggeeeeebbb AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:50 YT:Z:UU XS:A:- NH:i:2 CC:Z:= CP:i:748270 HI:i:0

R0230412_0118:1:1102:18601:110990#ATGTCA 355 1 748199 3 18M2I30M = 748270 121 GAGCAAGCCCCTATCTAGAAAAAAAAAAAATGTCAGTGAAGATGTGGAGG bbbeeeeeggggghiiiiiiiihiiiiiid]\V\Zd^^^^\^ZaM^_Z`` AS:i:-11 XN:i:0 XM:i:0 XO:i:1 XG:i:2 NM:i:2 MD:Z:48 YT:Z:UU XS:A:- NH:i:2 HI:i:1
R0230412_0118:1:1102:18601:110990#ATGTCA 403 1 748270 3 50M = 748199 -121 GGTGGGAACATAAAATTGTGTAACCATTTTGTTTGGGTATTTCTTTTCTT iiiiihihiiiihihiiihhghgdgeghiihiiiiiigggggeeeeebbb AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:50 YT:Z:UU XS:A:- NH:i:2 HI:i:1

However, multiple alignment sometimes make sense. The second example of multiple alignment comes from RNA splicing. The sum of the flags are 83 vs. 163, 339 vs. 419, and 353 vs. 401.
#splicing 1
R0230412_0118:1:2307:9799:151900#ATGTCA 83 17 7626607 1 8M138N41M149N1M = 7626471 -473 AAGGCCACCTCAGCCACCTTTTTGATTATTTTGGAACCAATCCCTTGCCC hhffdeahgdhhhhfgihfiiihhhhfiiiihfffghfggegeeca\Pa_ AS:i:-2 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:47A2 YT:Z:UU XS:A:- NH:i:3 CC:Z:= CP:i:7626607 HI:i:0
R0230412_0118:1:2307:9799:151900#ATGTCA 163 17 7626471 1 50M = 7626607 473 TGCCTCTCCTTGAAAGCAGAAGAAGTGCCAGCCCTCAGCTTCCGTCAGAT _abeeceegfcg^``egadbggdbdbefggfdfhhfhhddgdhhgeff^c AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:50 YT:Z:UU XS:A:- NH:i:3 CC:Z:= CP:i:7626471 HI:i:0
#splicing 2
R0230412_0118:1:2307:9799:151900#ATGTCA 339 17 7626607 1 8M138N41M349N1M = 7626471 -673 AAGGCCACCTCAGCCACCTTTTTGATTATTTTGGAACCAATCCCTTGCCC hhffdeahgdhhhhfgihfiiihhhhfiiiihfffghfggegeeca\Pa_ AS:i:-2 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:47A2 YT:Z:UU XS:A:- NH:i:3 CC:Z:= CP:i:7626607 HI:i:1
R0230412_0118:1:2307:9799:151900#ATGTCA 419 17 7626471 1 50M = 7626607 673 TGCCTCTCCTTGAAAGCAGAAGAAGTGCCAGCCCTCAGCTTCCGTCAGAT _abeeceegfcg^``egadbggdbdbefggfdfhhfhhddgdhhgeff^c AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:50 YT:Z:UU XS:A:- NH:i:3 CC:Z:= CP:i:7626471 HI:i:1
#splicing 3
R0230412_0118:1:2307:9799:151900#ATGTCA 339 17 7626607 1 8M138N41M569N1M = 7626471 -893 AAGGCCACCTCAGCCACCTTTTTGATTATTTTGGAACCAATCCCTTGCCC hhffdeahgdhhhhfgihfiiihhhhfiiiihfffghfggegeeca\Pa_ AS:i:-2 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:47A2 YT:Z:UU XS:A:- NH:i:3 HI:i:2
R0230412_0118:1:2307:9799:151900#ATGTCA 419 17 7626471 1 50M = 7626607 893 TGCCTCTCCTTGAAAGCAGAAGAAGTGCCAGCCCTCAGCTTCCGTCAGAT _abeeceegfcg^``egadbggdbdbefggfdfhhfhhddgdhhgeff^c AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:50 YT:Z:UU XS:A:- NH:i:3 HI:i:2

the flags of single-end alignment
The sum of flags for single-end alignment is always the even.
Decimal
Binary
Hexadecimal
Description
0
000000000000
0x0
This read is aligned to the forward strand of reference
4
000000000100
0x4
This read is not aligned.
16
000000010000
0x10
This read is aligned to the reverse-complementary strand of reference.

The other fields
AS:i:<N>:Alignment score generated by aligner. The higher <N> would be better. It is usually more than 0, but It can be negative. It is only present for an aligned read.

NM:i:<N>, XM:i:<N>, and MD:Z:<N>: the edit distance, the number and positions of mismatches. Both of the fields are often combined with CIGAR string for analyzing mutations including SNPs/insert/delete. For example:The read length is 50 bases.
All matches
CIGAR=50M,
XM:i:0
NM:i:0
MD:Z:50
All bases of the read matched to the reference.
One SNP
CIGAR=50M
XM:i:1
NM:i:1
MD:Z:19C30
One SNP site (The base is C at the reference) is at the 20th base of the read.
across an intron
CIGAR=21M659N29M
XM:i:0
NM:i:0
MD:Z:50
All bases matched to the reference. But the read cross an intron on the DNA reference. The regions with 21 and 29 bases aligned to two exon, respectively. The intron length is 659 bases.
One insertion, and one SNP
CIGAR=45M1I4M
XM:i:1
NM:i:2
MD:Z:47G1
There are two mismatched bases in the read. Of the read, the first 45 bases aligned to the reference, and the 46th base is insertion, and the 47-50th bases aligned to the reference. One SNP (G in reference) at the 49th base in the read is detected.
One insertion
CIGAR=12M2I36M
XM:i:0
NM:i:2
MD:Z:48
There are two mismatched bases in the read. Of the read, the first 12 bases aligned to the reference, and the 13-14th bases are insertion, and the last 36 bases aligned to the reference.
One deletion
CIGAR=15M2D35M
XM:i:0
NM:i:2
MD:Z:15^AG35
There are two mismatched bases in the read. Of the read, the first 15 bases aligned to the reference, and the two bases after 15 th are deletion, and the last 35 bases aligned to the reference.
One deletion, and one SNP
CIGAR=13M1D37M
XM:i:1
NM:i:2
MD:Z:0A12^A37
There are two mismatched bases in the read. Of the read, the first base is SNP (A in reference), and the 2-13th bases are aligned, and one base after the 13th base is deletion, and the last 37 bases aligned to reference.

CC:i:<N> and CP:i:<N>: Chromosome name and offset of next alignment, '=' if on the same chr.

HI:i:<N>: hit index (increasing from from 0 to NH-1). The best hit does not have to be the first one.



Writing date: 2014.09.23, 2015.02.04

No comments:

Post a Comment