Development guide
Introduction
This document is intended to serve as a guide for developers who wish to use and/or contribute to the MolTools Java library.
For starters, the central Sequence
and SequenceDB
data structures
are described. Also, the sequence IO system is described.
The IO system is designed to allow reading sequence data from a wide selection of file formats. The design is modular in order to allow individual components to be changed to slightly change the manner in which sequence data is processed.
In the terms of the MolTools library, the sequence IO task is to create a set of Sequence
objects from
a stream of data, and conversely, to provide a stream of data from a set of Sequence
objects.
The set of sequences read or written can be represented as a Java Collection
, or as a MolTools
SequenceDB
object.
The Sequence structure
The basic sequence data structure used in the MolTools library is defined
by the Sequence
interface, which defines the following methods:
- String getName()
- Returns the name of the sequence
- int length()
- Returns the length (number of nucleotides for a DNA sequence) of the sequence
- String seqString()
- Returns the default String representation of the sequence, using the standard symbols for individual subunits (nucleotides)
- String subsequence(int start, int end)
- Returns the sub-sequence from start to end (both inclusive). The first position of the sequence is position '1'
There are several implementations of this interface, so to allow the client to select the implementation
created when reading sequence data, the SequenceBuilder
interface is used.
A SequenceBuilder
object is responsible for creating a particular type of sequences from a
Map
of sequence properties. The common properties are sequence name and sequence data,
although other properties may be defined. The builder in use determines how the property values are used
to create the Sequence
object. The builder may also be supplied with a SequenceConverter
object responsible for converting the actual sequence data from one notation to another (such as removing brackets
or parentheses), or changing between lower and upper case.
The SequenceDB structure
Sets of sequences are generally represented by a SequenceDB
object. These come in different shapes and sizes.
The simplest type is required to define the following methods:
- boolean containsSequence(Sequence seq)
- Check whether this DB contains 'seq'
- boolean isEmpty()
- Check whether this DB is empty
- Iterator iterator()
- Return an
Iterator
over the sequences in the DB - public int size()
- Return the size of the DB
- SequenceDB select(SequencePredicate p)
- Return a DB containing all sequences that match the predicate p
Of course, a few of these methods are redundant, as the same information can be obtained by using the other methods, but this allows the implementation to decide how to carry out some of the actions, which may allow for performance optimization.
Other DB types allow the addition and removal of sequences, and access to sequences through an integer index,
or through the sequence name. When importing sequences, the SequenceDB
implementation that is used is
determined by a SequenceDBBuilder
. In addition to constructing and populating a DB, this builder may
perform other actions on the sequences, such as sorting or filtering. The DBbuilder is provided with a List
of
sequence data Map
s, a SequenceBuilder
, and (optionally) a SequenceConverter
.
This gives the DBbuilder the potential to tightly control the process, while the default builder simply passes each
sequence data Map to the sequence builder together with the converter and stores the resulting Sequence
object in the DB.
The sequence IO system
The IO system is designed to allow reading sequence data from and writing data to a wide selection of file formats. The design is modular in order to allow individual components to be changed to slightly change the manner in which sequence data is processed.
In the terms of the MolTools library, the sequence IO task is to create a set of Sequence
objects from
a stream of data, and conversely, to provide a stream of data from a set of Sequence
objects.
The set of sequences read or written can be represented as a Java Collection
, or as a MolTools
SequenceDB
object.
Reading sequences from file
As sequences are created by providing a builder with a Map
of property values, the task of the
sequence input is to create a list of such maps. These may then be used directly to create the sequences with a
provided builder, or be sent along with the builder to a DBbuilder for special processing.
The input data stream is read by a SequenceDBReader
which is supplied with a
SequenceInputFormatter
. In general, the formatter reads individual sequence records, and creates a
map of data to be sent to a SequenceBuilder. A simple reader thus opens the stream, and calls the formatter repeatedly
until all sequences have been read. A reader may also begin with reading a set of header data from the stream, and then
read the sequence records, creating the data maps based somehow on the data found in the header as well as the individual
records. This system should be flexible enough to read from most structured text-based file formats. The reader may however choose
to perform the read in some other fashion, not using a formatter, allowing for other file formats as well.
A SequenceDBInputFormat
object defines a file format by providing both a SequenceDBReader
and a
SequenceInputFormatter
.
The SequenceIO
class provides a few utility methods to read sequences from file, using default implementations
of the input format classes. Reading a Fasta-formatted file without any extra modification may thus be done in any of the followng ways:
//The input file File inputFile = new File("example.fas"); //The sequence builder that creates the sequence type of interest SequenceBuilder builder = new SimpleDNASequenceBuilder(); //Get a sequence DB with the default implementations of //SequenceDBBuilder and SequenceDBReader, and no SequenceConverter SequenceDB db = SequenceIO.getSequenceDB( new FileInputStream(inputFile), //InputStream null, //SequenceDBBuilder null, //SequenceDBReader new FastaFormatter(), //SequenceInputFormatter null, //SequenceConverter builder); //SequenceBuilder //This is equivalent to: SequenceDB db1 = SequenceIO.getSequenceDB( new FileInputStream(inputFile), //InputStream ListSequenceDB.getDefaultBuilder(), //The default SequenceDBBuilder SequenceIO.defaultDBIO, //The default SequenceDBReader new FastaFormatter(), //SequenceInputFormatter null, //SequenceConverter builder); //SequenceBuilder //Using a SequenceDBInputFormat for reader and formatter: SequenceDB db2 = SequenceIO.getSequenceDB( new FileInputStream(inputFile), //InputStream null, //SequenceDBBuilder new FastaDBFormat(), //SequenceDBInputFormat null, //SequenceConverter builder); //SequenceBuilder //or to just get a Collection Collection c = SequenceIO.getSequences( new FileInputStream(inputFile), //InputStream new FastaFormatter(), //SequenceInputFormatter builder); //SequenceBuilder
The DNASeqIO
class also provides convenience methods for creating the
default implementation of sequences:
DNASeqIO.getDNASequences( new FileInputStream(inputFile), //InputStream new FastaFormatter()); //SequenceInputFormatter
Writing sequences to file
Writing sequences is easier, since we need not care about what implementation of sequence or DB we have.
Analogous to input, sequence output uses a SequenceDBWriter
and a SequenceOutputFormatter
to do the job. The output formatter generates records from a sequence, while the writer outputs these sequences
along with any special file formatting such as file headers with general properties.
Convenience methods are available for output as well, as shown:
//Using the default implementation of SequenceDBWriter, and no converter SequenceIO.putSequenceDB( new FileOutputStream(outputFile), //OutputStream db, //The SequenceDB to write null, //SequenceDBWriter null, //SequenceConverter new FastaFormatter(), //SequenceOutputFormatter null); //Optional Map of properties for the writer //This is equivalent to SequenceIO.putSequenceDB( new FileOutputStream(outputFile), //OutputStream db, //The SequenceDB to write SequenceIO.defaultDBIO, //SequenceDBWriter null, //SequenceConverter new FastaFormatter(), //SequenceOutputFormatter null); //Optional Map of properties for the writer //Omitting the writer (same as providing a null writer SequenceIO.putSequenceDB( new FileOutputStream(outputFile), //OutputStream db, //The SequenceDB to write null, //SequenceConverter new FastaFormatter(), //SequenceOutputFormatter null); //Optional Map of properties for the writer //Using a SequenceDBOutputFormat SequenceIO.putSequenceDB( new FileOutputStream(outputFile), //OutputStream db, //The SequenceDB to write null, //SequenceConverter new FastaDBFormat(), //SequenceDBOutputFormat null); //Optional Map of properties for the writer