fasta

是一种基于文本用于表示核酸序列或多肽序列的格式。其中核酸或复基酸均以单个字母来表示,且允许在序列前添加序列名及注释

>MCHU - Calmodulin - Human, rabbit, bovine, rat, and chicken
MADQLTEEQIAEFKEAFSLFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGNGTID
FPEFLTMMARKMKDTDSEEEIREAFRVFDKDGNGYISAAELRHVMTNLGEKLTDEEVDEMIREA
DIDGDGQVNYEEFVQMMTAK*

特征:2行,id行和序列行。

fastq

是一种存储了生物序列以及相应的质量评价的文本格式。测序的原始序列

@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65

特征:4行为一个单位

gff/gtf

gff(General Feature Format)

记录序列中转录起始位点、基因、外显子、内含子等组成元件在染色体中的位置信息
现在用得比较多的是第3版,即gff3

Position index Position name Description
1 seqid 序列的编号。通常是染色体或者Contig ID,比如:Chr01 或者 scaffold_1
2 source 注释的来源。一般会是预测用软件工具或者数据库
3 type The feature type name, like "gene" or "exon". In a well structured GFF file, all the children features always follow their parents in a single block (so all exons of a transcript are put after their parent "transcript" feature line and before any other parent transcript line). In GFF3, all features and their relationships should be compatible with the standards released by the Sequence Ontology Project.
4 start Genomic start of the feature, with a 1-base offset. This is in contrast with other 0-offset half-open sequence formats, like BED.
5 end Genomic end of the feature, with a 1-base offset. This is the same end coordinate as it is in 0-offset half-open sequence formats, like BED.[citation needed]
6 score Numeric value that generally indicates the confidence of the source in the annotated feature. A value of "." (a dot) is used to define a null value.
7 strand Single character that indicates the strand of the feature. This can be "+" (positive, or 5'->3'), "-", (negative, or 3'->5'), "." (undetermined), or "?" for features with relevant but unknown strands.
8 phase phase of CDS features; it can be either one of 0, 1, 2 (for CDS features) or "." (for everything else). See the section below for a detailed explanation.
9 attributes A list of tag-value pairs separated by a semicolon with additional information about the feature.

##gff-version 3
# This is a test sample
scaffold625	maker	gene	337818	343277	.	+	.	ID=CLUHARG00000005458;Name=TUBB3_2
scaffold625	maker	mRNA	337818	343277	.	+	.	ID=CLUHART00000008717;Parent=CLUHARG00000005458
scaffold625	maker	tss	337916	337918	.	+	.	ID=CLUHART00000008717:tss;Parent=CLUHART00000008717
scaffold625	maker	start_codon	337916	337918	.	+	.	ID=CLUHART00000008717:start;Parent=CLUHART00000008717
scaffold625	maker	CDS	337915	337971	.	+	0	ID=CLUHART00000008717:cds;Parent=CLUHART00000008717
scaffold625	maker	CDS	340733	340841	.	+	0	ID=CLUHART00000008717:cds;Parent=CLUHART00000008717
scaffold625	maker	CDS	341518	341628	.	+	2	ID=CLUHART00000008717:cds;Parent=CLUHART00000008717
scaffold625	maker	CDS	341964	343033	.	+	2	ID=CLUHART00000008717:cds;Parent=CLUHART00000008717
scaffold625	maker	stop_codon	343031	343033	.	+	.	ID=CLUHART00000008717:stop;Parent=CLUHART00000008717
scaffold625	maker	exon	337818	337971	.	+	.	ID=CLUHART00000008717:exon1;Parent=CLUHART00000008717
scaffold625	maker	exon	340733	340841	.	+	.	ID=CLUHART00000008717:exon2;Parent=CLUHART00000008717
scaffold625	maker	exon	341518	341628	.	+	.	ID=CLUHART00000008717:exon3;Parent=CLUHART00000008717
scaffold625	maker	exon	341964	343277	.	+	.	ID=CLUHART00000008717:exon4;Parent=CLUHART00000008717

gtf(General Transfer Format)

现在用得比较多的是第2版,即gtf2

可以用cuff1inks里的gffreadi命令互相转换格式

格式:

attributes 含义
gene_id "value"; 表示转录本在基因组上的基因座的唯一ID。gene_id与value值用空格分开,如果值为空,则表示没有对应的基因。
transcript_id "value"; 预测转录本的唯一ID。transcript_id与value值用空格分开,空表示没有转录本。
140    Twinscan    inter         5141     8522     .    -    .    gene_id ""; transcript_id "";
140    Twinscan    inter_CNS     8523     9711     .    -    .    gene_id ""; transcript_id "";
140    Twinscan    inter         9712     13182    .    -    .    gene_id ""; transcript_id "";
140    Twinscan    3UTR          65149    65487    .    -    .    gene_id "140.000"; transcript_id "140.000.1";
140    Twinscan    3UTR          66823    66992    .    -    .    gene_id "140.000"; transcript_id "140.000.1";
140    Twinscan    stop_codon    66993    66995    .    -    0    gene_id "140.000"; transcript_id "140.000.1";
140    Twinscan    CDS           66996    66999    .    -    1    gene_id "140.000"; transcript_id "140.000.1";
140    Twinscan    intron_CNS    70103    70151    .    -    .    gene_id "140.000"; transcript_id "140.000.1";
140    Twinscan    CDS           70207    70294    .    -    2    gene_id "140.000"; transcript_id "140.000.1";
140    Twinscan    CDS           71696    71807    .    -    0    gene_id "140.000"; transcript_id "140.000.1";
140    Twinscan    start_codon   71805    71806    .    -    0    gene_id "140.000"; transcript_id "140.000.1";
140    Twinscan    start_codon   73222    73222    .    -    2    gene_id "140.000"; transcript_id "140.000.1";
140    Twinscan    CDS           73222    73222    .    -    0    gene_id "140.000"; transcript_id "140.000.1";
140    Twinscan    5UTR          73223    73504    .    -    .    gene_id "140.000"; transcript_id "140.000.1";

转换

#from GFF to GTF
gffread example.gff3 -T -o new.gtf
#from GTF to GFF
gffread example.gtf -o- > new.gff3

但是,使用gffread进行格式转换时容易导致信息的丢失,因此更推荐使用另一种软件AGAT。AGAT不仅可以进行GFF/GTF文件的格式转换,还能够检查修复、填充GFF和GTF的缺失信息。

AGAT

image