fasta

是一种基于文本用于表示核酸序列或多肽序列的格式。其中核酸或复基酸均以单个字母来表示，且允许在序列前添加序列名及注释

>MCHU - Calmodulin - Human, rabbit, bovine, rat, and chicken
MADQLTEEQIAEFKEAFSLFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGNGTID
FPEFLTMMARKMKDTDSEEEIREAFRVFDKDGNGYISAAELRHVMTNLGEKLTDEEVDEMIREA
DIDGDGQVNYEEFVQMMTAK*

特征：2行，id行和序列行。

id行以“>”开头，有时候会包含注释信息
序列行一个字母表示一个碱基（A、T、C、G、N）/氨基酸（20种常见氨基酸）

fastq

是一种存储了生物序列以及相应的质量评价的文本格式。测序的原始序列

@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65

特征：4行为一个单位

第一行，id行，以@开头，记录必要信息
第二行，序列行
第三行，附加信息行(一般就一个加号)
第四行，碱基质量行

gff/gtf

gff(General Feature Format)

记录序列中转录起始位点、基因、外显子、内含子等组成元件在染色体中的位置信息
现在用得比较多的是第3版，即gff3

Position index	Position name	Description
1	seqid	序列的编号。通常是染色体或者Contig ID，比如：Chr01 或者 scaffold_1
2	source	注释的来源。一般会是预测用软件工具或者数据库
3	type	The feature type name, like "gene" or "exon". In a well structured GFF file, all the children features always follow their parents in a single block (so all exons of a transcript are put after their parent "transcript" feature line and before any other parent transcript line). In GFF3, all features and their relationships should be compatible with the standards released by the Sequence Ontology Project.
4	start	Genomic start of the feature, with a 1-base offset. This is in contrast with other 0-offset half-open sequence formats, like BED.
5	end	Genomic end of the feature, with a 1-base offset. This is the same end coordinate as it is in 0-offset half-open sequence formats, like BED.[citation needed]
6	score	Numeric value that generally indicates the confidence of the source in the annotated feature. A value of "." (a dot) is used to define a null value.
7	strand	Single character that indicates the strand of the feature. This can be "+" (positive, or 5'->3'), "-", (negative, or 3'->5'), "." (undetermined), or "?" for features with relevant but unknown strands.
8	phase	phase of CDS features; it can be either one of 0, 1, 2 (for CDS features) or "." (for everything else). See the section below for a detailed explanation.
9	attributes	A list of tag-value pairs separated by a semicolon with additional information about the feature.


##gff-version 3
# This is a test sample
scaffold625	maker	gene	337818	343277	.	+	.	ID=CLUHARG00000005458;Name=TUBB3_2
scaffold625	maker	mRNA	337818	343277	.	+	.	ID=CLUHART00000008717;Parent=CLUHARG00000005458
scaffold625	maker	tss	337916	337918	.	+	.	ID=CLUHART00000008717:tss;Parent=CLUHART00000008717
scaffold625	maker	start_codon	337916	337918	.	+	.	ID=CLUHART00000008717:start;Parent=CLUHART00000008717
scaffold625	maker	CDS	337915	337971	.	+	0	ID=CLUHART00000008717:cds;Parent=CLUHART00000008717
scaffold625	maker	CDS	340733	340841	.	+	0	ID=CLUHART00000008717:cds;Parent=CLUHART00000008717
scaffold625	maker	CDS	341518	341628	.	+	2	ID=CLUHART00000008717:cds;Parent=CLUHART00000008717
scaffold625	maker	CDS	341964	343033	.	+	2	ID=CLUHART00000008717:cds;Parent=CLUHART00000008717
scaffold625	maker	stop_codon	343031	343033	.	+	.	ID=CLUHART00000008717:stop;Parent=CLUHART00000008717
scaffold625	maker	exon	337818	337971	.	+	.	ID=CLUHART00000008717:exon1;Parent=CLUHART00000008717
scaffold625	maker	exon	340733	340841	.	+	.	ID=CLUHART00000008717:exon2;Parent=CLUHART00000008717
scaffold625	maker	exon	341518	341628	.	+	.	ID=CLUHART00000008717:exon3;Parent=CLUHART00000008717
scaffold625	maker	exon	341964	343277	.	+	.	ID=CLUHART00000008717:exon4;Parent=CLUHART00000008717

gtf(General Transfer Format)

现在用得比较多的是第2版，即gtf2

可以用cuff1inks里的gffreadi命令互相转换格式

格式：

seqname：序列名称。通常是染色体或者Contig ID，比如：Chr01 或者 scaffold_1
source：注释的来源。一般会是预测用软件工具或者数据库
feature：注释信息类型（基因结构），这列信息是必须的，至少包括比如CDS、start_codon、stop_codon，还可以包含5UTR、3UTR、inter、inter_CNS、intron_CNS、exon、mRNA等等
start：上述feature起始位点。从1 起始计数
end：上述feature结束位点
score：表示feature的存在和坐标的置信度，可以是一个浮点数或者整数，"." 表示为空，就是不需要
strand：该feature位于参考序列的正链（+）或者负链（+）
frame：0 表示阅读框的第一个完整密码子位于最5’端，1 表示在第一个完整密码子之前有一个额外的碱基，2 表示在第一整个密码子之前还有两个额外的碱基。注意，frame不是CDS长度除以3的余数（CDS mod 3）。如果链是“-”，则该区域的第一个碱基的值为end，因为对应的编码区域将在反向链上从end到start。
attributes ：应具有的格式为attribute_name “attribut_value”；，每个属性必须以分号结尾并且与下一个属性之间以空格分隔，并且属性的值应该用双引号包围。并且必须包含以下两个属性：

attributes	含义
gene_id "value";	表示转录本在基因组上的基因座的唯一ID。gene_id与value值用空格分开，如果值为空，则表示没有对应的基因。
transcript_id "value";	预测转录本的唯一ID。transcript_id与value值用空格分开，空表示没有转录本。

140    Twinscan    inter         5141     8522     .    -    .    gene_id ""; transcript_id "";
140    Twinscan    inter_CNS     8523     9711     .    -    .    gene_id ""; transcript_id "";
140    Twinscan    inter         9712     13182    .    -    .    gene_id ""; transcript_id "";
140    Twinscan    3UTR          65149    65487    .    -    .    gene_id "140.000"; transcript_id "140.000.1";
140    Twinscan    3UTR          66823    66992    .    -    .    gene_id "140.000"; transcript_id "140.000.1";
140    Twinscan    stop_codon    66993    66995    .    -    0    gene_id "140.000"; transcript_id "140.000.1";
140    Twinscan    CDS           66996    66999    .    -    1    gene_id "140.000"; transcript_id "140.000.1";
140    Twinscan    intron_CNS    70103    70151    .    -    .    gene_id "140.000"; transcript_id "140.000.1";
140    Twinscan    CDS           70207    70294    .    -    2    gene_id "140.000"; transcript_id "140.000.1";
140    Twinscan    CDS           71696    71807    .    -    0    gene_id "140.000"; transcript_id "140.000.1";
140    Twinscan    start_codon   71805    71806    .    -    0    gene_id "140.000"; transcript_id "140.000.1";
140    Twinscan    start_codon   73222    73222    .    -    2    gene_id "140.000"; transcript_id "140.000.1";
140    Twinscan    CDS           73222    73222    .    -    0    gene_id "140.000"; transcript_id "140.000.1";
140    Twinscan    5UTR          73223    73504    .    -    .    gene_id "140.000"; transcript_id "140.000.1";

转换

#from GFF to GTF
gffread example.gff3 -T -o new.gtf
#from GTF to GFF
gffread example.gtf -o- > new.gff3

但是，使用gffread进行格式转换时容易导致信息的丢失，因此更推荐使用另一种软件AGAT。AGAT不仅可以进行GFF/GTF文件的格式转换，还能够检查修复、填充GFF和GTF的缺失信息。

AGAT

生物信息学常见数据格式

fasta

fastq

gff/gtf

gff(General Feature Format)

gtf(General Transfer Format)