Development guide
Introduction
This document is intended to serve as a guide for developers who wish to use and/or contribute to the MolTools Java library.
For starters, the central Sequence and SequenceDB data structures
are described. Also, the sequence IO system is described.
The IO system is designed to allow reading sequence data from a wide selection of file formats. The design is modular in order to allow individual components to be changed to slightly change the manner in which sequence data is processed.
In the terms of the MolTools library, the sequence IO task is to create a set of Sequence objects from
a stream of data, and conversely, to provide a stream of data from a set of Sequence objects.
The set of sequences read or written can be represented as a Java Collection, or as a MolTools
SequenceDB object.
The Sequence structure
The basic sequence data structure used in the MolTools library is defined
by the Sequence interface, which defines the following methods:
- String getName()
- Returns the name of the sequence
- int length()
- Returns the length (number of nucleotides for a DNA sequence) of the sequence
- String seqString()
- Returns the default String representation of the sequence, using the standard symbols for individual subunits (nucleotides)
- String subsequence(int start, int end)
- Returns the sub-sequence from start to end (both inclusive). The first position of the sequence is position '1'
There are several implementations of this interface, so to allow the client to select the implementation
created when reading sequence data, the SequenceBuilder interface is used.
A SequenceBuilder object is responsible for creating a particular type of sequences from a
Map of sequence properties. The common properties are sequence name and sequence data,
although other properties may be defined. The builder in use determines how the property values are used
to create the Sequence object. The builder may also be supplied with a SequenceConverter
object responsible for converting the actual sequence data from one notation to another (such as removing brackets
or parentheses), or changing between lower and upper case.
The SequenceDB structure
Sets of sequences are generally represented by a SequenceDB object. These come in different shapes and sizes.
The simplest type is required to define the following methods:
- boolean containsSequence(Sequence seq)
- Check whether this DB contains 'seq'
- boolean isEmpty()
- Check whether this DB is empty
- Iterator iterator()
- Return an
Iteratorover the sequences in the DB - public int size()
- Return the size of the DB
- SequenceDB select(SequencePredicate p)
- Return a DB containing all sequences that match the predicate p
Of course, a few of these methods are redundant, as the same information can be obtained by using the other methods, but this allows the implementation to decide how to carry out some of the actions, which may allow for performance optimization.
Other DB types allow the addition and removal of sequences, and access to sequences through an integer index,
or through the sequence name. When importing sequences, the SequenceDB implementation that is used is
determined by a SequenceDBBuilder. In addition to constructing and populating a DB, this builder may
perform other actions on the sequences, such as sorting or filtering. The DBbuilder is provided with a List of
sequence data Maps, a SequenceBuilder, and (optionally) a SequenceConverter.
This gives the DBbuilder the potential to tightly control the process, while the default builder simply passes each
sequence data Map to the sequence builder together with the converter and stores the resulting Sequence
object in the DB.
The sequence IO system
The IO system is designed to allow reading sequence data from and writing data to a wide selection of file formats. The design is modular in order to allow individual components to be changed to slightly change the manner in which sequence data is processed.
In the terms of the MolTools library, the sequence IO task is to create a set of Sequence objects from
a stream of data, and conversely, to provide a stream of data from a set of Sequence objects.
The set of sequences read or written can be represented as a Java Collection, or as a MolTools
SequenceDB object.
Reading sequences from file
As sequences are created by providing a builder with a Map of property values, the task of the
sequence input is to create a list of such maps. These may then be used directly to create the sequences with a
provided builder, or be sent along with the builder to a DBbuilder for special processing.
The input data stream is read by a SequenceDBReader which is supplied with a
SequenceInputFormatter. In general, the formatter reads individual sequence records, and creates a
map of data to be sent to a SequenceBuilder. A simple reader thus opens the stream, and calls the formatter repeatedly
until all sequences have been read. A reader may also begin with reading a set of header data from the stream, and then
read the sequence records, creating the data maps based somehow on the data found in the header as well as the individual
records. This system should be flexible enough to read from most structured text-based file formats. The reader may however choose
to perform the read in some other fashion, not using a formatter, allowing for other file formats as well.
A SequenceDBInputFormat object defines a file format by providing both a SequenceDBReader and a
SequenceInputFormatter.
The SequenceIO class provides a few utility methods to read sequences from file, using default implementations
of the input format classes. Reading a Fasta-formatted file without any extra modification may thus be done in any of the followng ways:
//The input file
File inputFile = new File("example.fas");
//The sequence builder that creates the sequence type of interest
SequenceBuilder builder = new SimpleDNASequenceBuilder();
//Get a sequence DB with the default implementations of
//SequenceDBBuilder and SequenceDBReader, and no SequenceConverter
SequenceDB db = SequenceIO.getSequenceDB(
new FileInputStream(inputFile), //InputStream
null, //SequenceDBBuilder
null, //SequenceDBReader
new FastaFormatter(), //SequenceInputFormatter
null, //SequenceConverter
builder); //SequenceBuilder
//This is equivalent to:
SequenceDB db1 = SequenceIO.getSequenceDB(
new FileInputStream(inputFile), //InputStream
ListSequenceDB.getDefaultBuilder(), //The default SequenceDBBuilder
SequenceIO.defaultDBIO, //The default SequenceDBReader
new FastaFormatter(), //SequenceInputFormatter
null, //SequenceConverter
builder); //SequenceBuilder
//Using a SequenceDBInputFormat for reader and formatter:
SequenceDB db2 = SequenceIO.getSequenceDB(
new FileInputStream(inputFile), //InputStream
null, //SequenceDBBuilder
new FastaDBFormat(), //SequenceDBInputFormat
null, //SequenceConverter
builder); //SequenceBuilder
//or to just get a Collection
Collection c = SequenceIO.getSequences(
new FileInputStream(inputFile), //InputStream
new FastaFormatter(), //SequenceInputFormatter
builder); //SequenceBuilder
The DNASeqIO class also provides convenience methods for creating the
default implementation of sequences:
DNASeqIO.getDNASequences(
new FileInputStream(inputFile), //InputStream
new FastaFormatter()); //SequenceInputFormatter
Writing sequences to file
Writing sequences is easier, since we need not care about what implementation of sequence or DB we have.
Analogous to input, sequence output uses a SequenceDBWriter and a SequenceOutputFormatter
to do the job. The output formatter generates records from a sequence, while the writer outputs these sequences
along with any special file formatting such as file headers with general properties.
Convenience methods are available for output as well, as shown:
//Using the default implementation of SequenceDBWriter, and no converter
SequenceIO.putSequenceDB(
new FileOutputStream(outputFile), //OutputStream
db, //The SequenceDB to write
null, //SequenceDBWriter
null, //SequenceConverter
new FastaFormatter(), //SequenceOutputFormatter
null); //Optional Map of properties for the writer
//This is equivalent to
SequenceIO.putSequenceDB(
new FileOutputStream(outputFile), //OutputStream
db, //The SequenceDB to write
SequenceIO.defaultDBIO, //SequenceDBWriter
null, //SequenceConverter
new FastaFormatter(), //SequenceOutputFormatter
null); //Optional Map of properties for the writer
//Omitting the writer (same as providing a null writer
SequenceIO.putSequenceDB(
new FileOutputStream(outputFile), //OutputStream
db, //The SequenceDB to write
null, //SequenceConverter
new FastaFormatter(), //SequenceOutputFormatter
null); //Optional Map of properties for the writer
//Using a SequenceDBOutputFormat
SequenceIO.putSequenceDB(
new FileOutputStream(outputFile), //OutputStream
db, //The SequenceDB to write
null, //SequenceConverter
new FastaDBFormat(), //SequenceDBOutputFormat
null); //Optional Map of properties for the writer

