Abstract:
Howe to analyze SAM files.
The definition of
the SAM format can be downloaded from
http://samtools.github.io/hts-specs/SAMv1.pdf. Be default, a SAM file
begins with SAM header, and then the records of alignment with at
least 12 fields separated by tabs.
SAM header
Be default, the SAM
header begins with @HD, @SQ, @RG and @PG lines. @HD line would be the
first line if present. @SQ are the dictionary of reference sequences.
Dictionary of
alignment record
Here is an example
of alignment record of paired-end alignment:
R0230412_0118:1:1103:9799:151900#ATGTCA 177 1 10541 0
50M 3 198180787 0
CCGAAATCTGTGCAGAGGAGAACGCAGCTCCGCCCTCGCGGTGCTCTCCG
ccccccccccccccdddccceeegggihhhihiiiiigggggeeeeebbb AS:i:-6
XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:19C30 YT:Z:UU
NH:i:20 CC:Z:12 CP:i:10658 HI:i:5
1: Name of the
read. Here, that is consistent with the name (the first line) from
FASTQ file
2: Sum of all
applicable flags. The details would be described below.
3: Name of
reference if alignment occurs, which is consistent with the name of
references in FASTA format file. Or would be ‘*’ if no alignment
occurs.
4: The
most-left-side offset at the forward strand of reference.
5: Mapping quality.
6: CIGAR string. It
can be used for mutation analysis.
7: Name of
reference for the alignment of the other read.
8: The
most-left-side offset for the alignment of the other read.
9: Inferred
fragment.
10: Read sequences.
It would be reverse complemented if it is aligned to the reverse
strand of reference.
11:ASCII-encoded
read qualities.
12: Optional fields
separated by tabs.
The flags of
paired-end alignment
The sum of flags for paired-end alignment is always the odd.
Decimal |
Binary |
Hexadecimal |
Description |
1 |
000000000001 |
0x1 |
This read is one of paired reads |
2 |
000000000010 |
0x2 |
This read is properly aligned to the forward strand of
reference for paired-end alignment. The "properly aligned"
means that they mapped within a proper distance from each other
(concordant pairs), the parameters of which are designed by -I/-x
in Bowtie2. |
4 |
000000000100 |
0x4 |
This read is not aligned. |
8 |
000000001000 |
0x8 |
This read is no aligned, and is one of paired reads. |
16 |
000000010000 |
0x10 |
This read is aligned to the reverse-complementary strand of
reference. |
32 |
000000100000 |
0x20 |
The other read of paired reads is aligned to the reverse strand
of reference. |
64 |
000001000000 |
0x40 |
This read is mate 1 in paired reads (R1). |
128 |
000010000000 |
0x80 |
This read is mate 2 in paired reads (R2). |
256 |
000100000000 |
0x100 |
Secondary alignment for multiple alignments |
512 |
001000000000 |
0x200 |
Not passing quality controls |
1024 |
010000000000 |
0x400 |
PCR or optical duplicate |
2048 |
100000000000 |
0x800 |
Supplementary alignment |
The next issue is
how to judge the alignment. The number of reads with flags should be
matched to each other. The lower number would be from the mate 1
(R1), and the higher would be from the mate 2 (R2). Something must be
wrong with your alignment if those paired numbers don't match. Here
are some examples:
Both alignment:
Good alignment reveals
that both reads in the paired reads are successfully aligned to the
reference. The difference between the two flags would be 64 mostly,
or 80, or 48.
65 - 000001000001
– This read is the first read (R1) of a pair. both reads are
aligned to the forward strand.
129 - 000010000001
- This read is the second read (R2) of a pair. Both reads are aligned
to the forward strand.
67 - 000001000001
– This read is the first read (R1) in a concordant pair. Both reads
are properly aligned to the forward strand.
131 - 000010000001
- This read is the second read (R2) in a concordant pair. Both reads
are properly aligned to the forward strand.
113 - 000001110001
- This is the first read (R1) of a pair. Both reads in pair were
aligned to reverse complementary strands of reference.
177 - 000010110001
- This is the second read (R2) of a pair. Both reads in pair were
aligned to reverse complementary strands of reference.
81 - 000001010001
- This is the first read (R1), and aligned to the reverse
complementary strand of reference. Both reads are aligned.
161 - 000010100001
- This is the second read (R2), and the other read (R1) aligned to
the reverse complementary strand of reference. Both reads are
aligned.
83 - 000001010011
- This is the first read (R1), and aligned to the reverse
complementary strand of reference. Both reads are properly aligned.
163 - 000010100011
- This is the second read (R2), and the other read (R1) aligned to
the reverse complementary strand of reference. Both reads are
properly aligned.
97 - 000001100001
- This is first read, its mate is flipped but this is forward. Both
reads are aligned.
145 - 000010010001
- This is second read. it is flipped but its mate is not. Both reads
are aligned.
90 - 000001100011
- This is first read, its mate is flipped but this is forward. Both
reads are properly aligned.
147 - 000010010011
- This is second read. it is flipped but its mate is not. Both reads
are properly aligned.
One alignment:
That reveals only one read is aligned ,and the other one is
unaligned.
69 - 000001000101
– The first read (R1) in pair. This read is unaligned, and the
other read is unaligned.
137 - 000010001001
– the second (R2) in pair. This read is aligned, and the other read
is unaligned.
73 - 000001001001
– The first read (R1) in pair. This read is aligned, and the other
read is unaligned.
133 - 000010000101
- The second read (R2) in pair. This read is unaligned, and the other
read is aligned.
No alignment:
Both of the reads of a pair is unaligned.
77 -
000001001101 - The first read (R1) in pair. Both of the reads are
unaligned.
141 - 000010001101
- The second read (R2) in pair. Both of the reads are unaligned.
Multiple
alignment: It is also known as ambiguous alignment. The sum of
flags is always more than 300 because the flag contains the bit of
0x100 (256). Here, I provide two examples.
First one is two
paired-end alignments for a pair aligned to the identical regions
with different chromosomal positions, the sum of the flags are 99
(64+32+2+1) vs. 147 (128+16+2+1) and 355 (256+64+32+2+1) vs. 403
(256+128+16+2+1). The first alignment might be real, and the second
one may not be real because the reads of this pair are mapped into
different chromosome.
R0230412_0118:1:1102:18601:110990#ATGTCA 99 1 147890 3 18M2I30M = 147961 121 GAGCAAGCCCCTATCTAGAAAAAAAAAAAATGTCAGTGAAGATGTGGAGG bbbeeeeeggggghiiiiiiiihiiiiiid]\V\Zd^^^^\^ZaM^_Z`` AS:i:-11 XN:i:0 XM:i:0 XO:i:1 XG:i:2 NM:i:2 MD:Z:48 YT:Z:UU XS:A:- NH:i:2 CC:Z:= CP:i:748199 HI:i:0
R0230412_0118:1:1102:18601:110990#ATGTCA 147 1 147961 3 50M = 147890 -121 GGTGGGAACATAAAATTGTGTAACCATTTTGTTTGGGTATTTCTTTTCTT iiiiihihiiiihihiiihhghgdgeghiihiiiiiigggggeeeeebbb AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:50 YT:Z:UU XS:A:- NH:i:2 CC:Z:= CP:i:748270 HI:i:0
R0230412_0118:1:1102:18601:110990#ATGTCA 355 1 748199 3 18M2I30M = 748270 121 GAGCAAGCCCCTATCTAGAAAAAAAAAAAATGTCAGTGAAGATGTGGAGG bbbeeeeeggggghiiiiiiiihiiiiiid]\V\Zd^^^^\^ZaM^_Z`` AS:i:-11 XN:i:0 XM:i:0 XO:i:1 XG:i:2 NM:i:2 MD:Z:48 YT:Z:UU XS:A:- NH:i:2 HI:i:1
R0230412_0118:1:1102:18601:110990#ATGTCA 403 1 748270 3 50M = 748199 -121 GGTGGGAACATAAAATTGTGTAACCATTTTGTTTGGGTATTTCTTTTCTT iiiiihihiiiihihiiihhghgdgeghiihiiiiiigggggeeeeebbb AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:50 YT:Z:UU XS:A:- NH:i:2 HI:i:1
However, multiple
alignment sometimes make sense. The second example of multiple
alignment comes from RNA splicing. The sum of the flags are 83 vs.
163, 339 vs. 419, and 353 vs. 401.
#splicing 1
R0230412_0118:1:2307:9799:151900#ATGTCA 83 17 7626607 1 8M138N41M149N1M = 7626471 -473 AAGGCCACCTCAGCCACCTTTTTGATTATTTTGGAACCAATCCCTTGCCC hhffdeahgdhhhhfgihfiiihhhhfiiiihfffghfggegeeca\Pa_ AS:i:-2 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:47A2 YT:Z:UU XS:A:- NH:i:3 CC:Z:= CP:i:7626607 HI:i:0
R0230412_0118:1:2307:9799:151900#ATGTCA 163 17 7626471 1 50M = 7626607 473 TGCCTCTCCTTGAAAGCAGAAGAAGTGCCAGCCCTCAGCTTCCGTCAGAT _abeeceegfcg^``egadbggdbdbefggfdfhhfhhddgdhhgeff^c AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:50 YT:Z:UU XS:A:- NH:i:3 CC:Z:= CP:i:7626471 HI:i:0
#splicing 2
R0230412_0118:1:2307:9799:151900#ATGTCA 339 17 7626607 1 8M138N41M349N1M = 7626471 -673 AAGGCCACCTCAGCCACCTTTTTGATTATTTTGGAACCAATCCCTTGCCC hhffdeahgdhhhhfgihfiiihhhhfiiiihfffghfggegeeca\Pa_ AS:i:-2 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:47A2 YT:Z:UU XS:A:- NH:i:3 CC:Z:= CP:i:7626607 HI:i:1
R0230412_0118:1:2307:9799:151900#ATGTCA 419 17 7626471 1 50M = 7626607 673 TGCCTCTCCTTGAAAGCAGAAGAAGTGCCAGCCCTCAGCTTCCGTCAGAT _abeeceegfcg^``egadbggdbdbefggfdfhhfhhddgdhhgeff^c AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:50 YT:Z:UU XS:A:- NH:i:3 CC:Z:= CP:i:7626471 HI:i:1
#splicing 3
R0230412_0118:1:2307:9799:151900#ATGTCA 339 17 7626607 1 8M138N41M569N1M = 7626471 -893 AAGGCCACCTCAGCCACCTTTTTGATTATTTTGGAACCAATCCCTTGCCC hhffdeahgdhhhhfgihfiiihhhhfiiiihfffghfggegeeca\Pa_ AS:i:-2 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:47A2 YT:Z:UU XS:A:- NH:i:3 HI:i:2
R0230412_0118:1:2307:9799:151900#ATGTCA 419 17 7626471 1 50M = 7626607 893 TGCCTCTCCTTGAAAGCAGAAGAAGTGCCAGCCCTCAGCTTCCGTCAGAT _abeeceegfcg^``egadbggdbdbefggfdfhhfhhddgdhhgeff^c AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:50 YT:Z:UU XS:A:- NH:i:3 HI:i:2
the flags of
single-end alignment
The sum of flags for single-end alignment is always the even.
Decimal |
Binary |
Hexadecimal |
Description |
0 |
000000000000 |
0x0 |
This read is aligned to the forward strand of reference |
4 |
000000000100 |
0x4 |
This read is not aligned. |
16 |
000000010000 |
0x10 |
This read is aligned to the reverse-complementary strand of
reference. |
The other fields
AS:i:<N>:Alignment
score generated by aligner. The higher <N> would be better. It
is usually more than 0, but It can be negative. It is only present
for an aligned read.
NM:i:<N>,
XM:i:<N>, and MD:Z:<N>:
the edit distance, the number and positions of mismatches. Both of
the fields are often combined with CIGAR string for analyzing
mutations including SNPs/insert/delete. For example:The read length
is 50 bases.
All matches |
CIGAR=50M,
XM:i:0
NM:i:0
MD:Z:50 |
All bases of the read matched to the reference. |
One SNP |
CIGAR=50M
XM:i:1
NM:i:1
MD:Z:19C30 |
One SNP site (The base is C at the reference) is at the 20th
base of the read. |
across an intron |
CIGAR=21M659N29M
XM:i:0
NM:i:0
MD:Z:50 |
All bases matched to the reference. But the read cross an
intron on the DNA reference. The regions with 21 and 29 bases
aligned to two exon, respectively. The intron length is 659 bases. |
One insertion, and one SNP |
CIGAR=45M1I4M
XM:i:1
NM:i:2MD:Z:47G1 |
There are two mismatched bases in the read. Of the read, the
first 45 bases aligned to the reference, and the 46th
base is insertion, and the 47-50th bases aligned to the reference.
One SNP (G in reference) at the 49th base in the read
is detected. |
One insertion |
CIGAR=12M2I36M
XM:i:0
NM:i:2MD:Z:48 |
There are two mismatched bases in the read. Of the read, the
first 12 bases aligned to the reference, and the 13-14th
bases are insertion, and the last 36 bases aligned to the
reference. |
One deletion |
CIGAR=15M2D35M
XM:i:0
NM:i:2MD:Z:15^AG35 |
There are two mismatched bases in the read. Of the read, the
first 15 bases aligned to the reference, and the two bases after
15 th are deletion, and the last 35 bases aligned to the
reference. |
One deletion, and one SNP |
CIGAR=13M1D37M
XM:i:1
NM:i:2MD:Z:0A12^A37 |
There are two mismatched bases in the read. Of the read, the
first base is SNP (A in reference), and the 2-13th bases are
aligned, and one base after the 13th base is deletion,
and the last 37 bases aligned to reference. |
CC:i:<N>
and CP:i:<N>: Chromosome name and offset of
next alignment, '=' if on the same chr.
HI:i:<N>:
hit index (increasing from from 0 to NH-1). The best hit does not
have to be the first one.
Writing date:
2014.09.23, 2015.02.04
No comments:
Post a Comment