Assemble and annotate a chloroplast genome

Feedback, comments, corrections:

What is genome assembly?

What’s in this tutorial?

What’s not in this tutorial?

Get data

Check read quality

What summary statistics would be useful to look at?

Click for answer

This will depend on the aim of your analysis, but usually:

  • Sequencing depth (the number of reads covering each base position; also called “coverage”). Higher depth is usually better, but at very high depths it may be better to subsample the reads, as errors can swamp the assembly graph.

  • Sequencing quality (the quality score indicates probability of base call being correct). You may trim or filter reads on quality. Phred quality scores are logarithmic: phred quality 10 = 90% chance of base call being correct; phred quality 20 = 99% chance of base call being correct. More detail here.

  • Read lengths (read lengths histogram, and reads lengths vs. quality plots). Your analysis or assembly may need reads of a certain length.

Assemble reads

assembly graph

What is your interpretation of this assembly graph?

Click for answer

One interpretation is that this represents the typical circular chloroplast structure: There is a long single-copy region (the node of around 78,000 bp), connected to the inverted repeat (a node of around 28,000 bp), connected to the short single-copy region (of around 11,000 bp). In the graph, each end loop is a single-copy region (either long or short) and the centre bar is the collapsed inverted repeat which should have about twice the sequencing depth.

Polish assembly

How does the short-read Pilon-polished assembly compare to the unpolished flye-assembly.fasta?

Click for answer

This will depend on the settings, but as an example: your polished assembly might be about 10-15 Kbp longer. Nanopore reads can have homopolymer deletions - a run of AAAA may be interpreted as AAA - so the more accurate illumina reads may correct these parts of the long-read assembly. In the Changes file, there may be a lot of cases showing a supposed deletion (represented by a dot) being corrected to a base.

View reads


What are some reasons that the read coverage may vary across the reference genome?

Click for answer

There may be lots of reasons. Some possibilities:

1/ In areas of high read coverage: this region may be a collapsed repeat.

2/ In areas of low or no coverage: this region may be difficult to sequence; or, this region may be a misassembly.


What are the differences between the nanopore and the illumina reads?

Click for answer

Nanopore reads are longer and have a higher error rate.

Annotate genome


Why might there be several annotations over the same genome region?

Click for answer

One reason is that these are predictions from different tools - such as BLAT or HMMER.

Repeat with new data

Tutorial summary

What were the main steps in this tutorial?

Click for answer

Get data → Assemble → Polish → Annotate

What common file types were used or created?

Click for answer

fastq fasta bam gff3

input_reads.fastqassembly.fastamapped_short_reads_to_assembly.bam and polished_assembly.fastaannotations.gff3

What’s next?