fasta
是一种基于文本用于表示核酸序列或多肽序列的格式。其中核酸或复基酸均以单个字母来表示,且允许在序列前添加序列名及注释
>MCHU - Calmodulin - Human, rabbit, bovine, rat, and chicken
MADQLTEEQIAEFKEAFSLFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGNGTID
FPEFLTMMARKMKDTDSEEEIREAFRVFDKDGNGYISAAELRHVMTNLGEKLTDEEVDEMIREA
DIDGDGQVNYEEFVQMMTAK*
特征:2行,id行和序列行。
-
id行以“>”开头,有时候会包含注释信息
-
序列行一个字母表示一个碱基(A、T、C、G、N)/氨基酸(20种常见氨基酸)
fastq
是一种存储了生物序列以及相应的质量评价的文本格式。测序的原始序列
@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
特征:4行为一个单位
-
第一行,id行,以@开头,记录必要信息
-
第二行,序列行
-
第三行,附加信息行(一般就一个加号)
-
第四行,碱基质量行
gff/gtf
gff(General Feature Format)
记录序列中转录起始位点、基因、外显子、内含子等组成元件在染色体中的位置信息
现在用得比较多的是第3版,即gff3
Position index | Position name | Description |
---|---|---|
1 | seqid | 序列的编号。通常是染色体或者Contig ID,比如:Chr01 或者 scaffold_1 |
2 | source | 注释的来源。一般会是预测用软件工具或者数据库 |
3 | type | The feature type name, like "gene" or "exon". In a well structured GFF file, all the children features always follow their parents in a single block (so all exons of a transcript are put after their parent "transcript" feature line and before any other parent transcript line). In GFF3, all features and their relationships should be compatible with the standards released by the Sequence Ontology Project. |
4 | start | Genomic start of the feature, with a 1-base offset. This is in contrast with other 0-offset half-open sequence formats, like BED. |
5 | end | Genomic end of the feature, with a 1-base offset. This is the same end coordinate as it is in 0-offset half-open sequence formats, like BED.[citation needed] |
6 | score | Numeric value that generally indicates the confidence of the source in the annotated feature. A value of "." (a dot) is used to define a null value. |
7 | strand | Single character that indicates the strand of the feature. This can be "+" (positive, or 5'->3'), "-", (negative, or 3'->5'), "." (undetermined), or "?" for features with relevant but unknown strands. |
8 | phase | phase of CDS features; it can be either one of 0, 1, 2 (for CDS features) or "." (for everything else). See the section below for a detailed explanation. |
9 | attributes | A list of tag-value pairs separated by a semicolon with additional information about the feature. |
##gff-version 3
# This is a test sample
scaffold625 maker gene 337818 343277 . + . ID=CLUHARG00000005458;Name=TUBB3_2
scaffold625 maker mRNA 337818 343277 . + . ID=CLUHART00000008717;Parent=CLUHARG00000005458
scaffold625 maker tss 337916 337918 . + . ID=CLUHART00000008717:tss;Parent=CLUHART00000008717
scaffold625 maker start_codon 337916 337918 . + . ID=CLUHART00000008717:start;Parent=CLUHART00000008717
scaffold625 maker CDS 337915 337971 . + 0 ID=CLUHART00000008717:cds;Parent=CLUHART00000008717
scaffold625 maker CDS 340733 340841 . + 0 ID=CLUHART00000008717:cds;Parent=CLUHART00000008717
scaffold625 maker CDS 341518 341628 . + 2 ID=CLUHART00000008717:cds;Parent=CLUHART00000008717
scaffold625 maker CDS 341964 343033 . + 2 ID=CLUHART00000008717:cds;Parent=CLUHART00000008717
scaffold625 maker stop_codon 343031 343033 . + . ID=CLUHART00000008717:stop;Parent=CLUHART00000008717
scaffold625 maker exon 337818 337971 . + . ID=CLUHART00000008717:exon1;Parent=CLUHART00000008717
scaffold625 maker exon 340733 340841 . + . ID=CLUHART00000008717:exon2;Parent=CLUHART00000008717
scaffold625 maker exon 341518 341628 . + . ID=CLUHART00000008717:exon3;Parent=CLUHART00000008717
scaffold625 maker exon 341964 343277 . + . ID=CLUHART00000008717:exon4;Parent=CLUHART00000008717
gtf(General Transfer Format)
现在用得比较多的是第2版,即gtf2
可以用cuff1inks里的gffreadi命令互相转换格式
格式:
-
seqname
: 序列名称。通常是染色体或者Contig ID,比如:Chr01 或者 scaffold_1 -
source
:注释的来源。一般会是预测用软件工具或者数据库 -
feature
:注释信息类型(基因结构),这列信息是必须的,至少包括比如CDS、start_codon、stop_codon,还可以包含5UTR、3UTR、inter、inter_CNS、intron_CNS、exon、mRNA等等 -
start
:上述feature
起始位点。从1 起始计数 -
end
:上述feature
结束位点 -
score
:表示feature
的存在和坐标的置信度,可以是一个浮点数或者整数,"." 表示为空,就是不需要 -
strand
:该feature
位于参考序列的正链(+)或者负链(+) -
frame
:0 表示阅读框的第一个完整密码子位于最5’端,1 表示在第一个完整密码子之前有一个额外的碱基,2 表示在第一整个密码子之前还有两个额外的碱基。注意,frame
不是CDS长度除以3的余数(CDS mod 3)。如果链是“-”,则该区域的第一个碱基的值为end
,因为对应的编码区域将在反向链上从end
到start
。 -
attributes
:应具有的格式为attribute_name “attribut_value”;
,每个属性必须以分号结尾并且与下一个属性之间以空格分隔,并且属性的值应该用双引号包围。并且必须包含以下两个属性:
attributes | 含义 |
---|---|
gene_id "value"; | 表示转录本在基因组上的基因座的唯一ID。gene_id与value值用空格分开,如果值为空,则表示没有对应的基因。 |
transcript_id "value"; | 预测转录本的唯一ID。transcript_id与value值用空格分开,空表示没有转录本。 |
140 Twinscan inter 5141 8522 . - . gene_id ""; transcript_id "";
140 Twinscan inter_CNS 8523 9711 . - . gene_id ""; transcript_id "";
140 Twinscan inter 9712 13182 . - . gene_id ""; transcript_id "";
140 Twinscan 3UTR 65149 65487 . - . gene_id "140.000"; transcript_id "140.000.1";
140 Twinscan 3UTR 66823 66992 . - . gene_id "140.000"; transcript_id "140.000.1";
140 Twinscan stop_codon 66993 66995 . - 0 gene_id "140.000"; transcript_id "140.000.1";
140 Twinscan CDS 66996 66999 . - 1 gene_id "140.000"; transcript_id "140.000.1";
140 Twinscan intron_CNS 70103 70151 . - . gene_id "140.000"; transcript_id "140.000.1";
140 Twinscan CDS 70207 70294 . - 2 gene_id "140.000"; transcript_id "140.000.1";
140 Twinscan CDS 71696 71807 . - 0 gene_id "140.000"; transcript_id "140.000.1";
140 Twinscan start_codon 71805 71806 . - 0 gene_id "140.000"; transcript_id "140.000.1";
140 Twinscan start_codon 73222 73222 . - 2 gene_id "140.000"; transcript_id "140.000.1";
140 Twinscan CDS 73222 73222 . - 0 gene_id "140.000"; transcript_id "140.000.1";
140 Twinscan 5UTR 73223 73504 . - . gene_id "140.000"; transcript_id "140.000.1";
转换
#from GFF to GTF
gffread example.gff3 -T -o new.gtf
#from GTF to GFF
gffread example.gtf -o- > new.gff3
但是,使用gffread进行格式转换时容易导致信息的丢失,因此更推荐使用另一种软件AGAT。AGAT不仅可以进行GFF/GTF文件的格式转换,还能够检查修复、填充GFF和GTF的缺失信息。
AGAT