MolTools library development guide

Development guide

Introduction

This document is intended to serve as a guide for developers who wish to use and/or contribute to the MolTools Java library.

For starters, the central Sequence and SequenceDB data structures are described. Also, the sequence IO system is described.

The IO system is designed to allow reading sequence data from a wide selection of file formats. The design is modular in order to allow individual components to be changed to slightly change the manner in which sequence data is processed.

In the terms of the MolTools library, the sequence IO task is to create a set of Sequence objects from a stream of data, and conversely, to provide a stream of data from a set of Sequence objects. The set of sequences read or written can be represented as a Java Collection, or as a MolTools SequenceDB object.

The Sequence structure

The basic sequence data structure used in the MolTools library is defined by the Sequence interface, which defines the following methods:

String getName(): Returns the name of the sequence
int length(): Returns the length (number of nucleotides for a DNA sequence) of the sequence
String seqString(): Returns the default String representation of the sequence, using the standard symbols for individual subunits (nucleotides)
String subsequence(int start, int end): Returns the sub-sequence from start to end (both inclusive). The first position of the sequence is position '1'

There are several implementations of this interface, so to allow the client to select the implementation created when reading sequence data, the SequenceBuilder interface is used. A SequenceBuilder object is responsible for creating a particular type of sequences from a Map of sequence properties. The common properties are sequence name and sequence data, although other properties may be defined. The builder in use determines how the property values are used to create the Sequence object. The builder may also be supplied with a SequenceConverter object responsible for converting the actual sequence data from one notation to another (such as removing brackets or parentheses), or changing between lower and upper case.

The SequenceDB structure

Sets of sequences are generally represented by a SequenceDB object. These come in different shapes and sizes. The simplest type is required to define the following methods:

boolean containsSequence(Sequence seq): Check whether this DB contains 'seq'
boolean isEmpty(): Check whether this DB is empty
Iterator iterator(): Return an Iterator over the sequences in the DB
public int size(): Return the size of the DB
SequenceDB select(SequencePredicate p): Return a DB containing all sequences that match the predicate p

Of course, a few of these methods are redundant, as the same information can be obtained by using the other methods, but this allows the implementation to decide how to carry out some of the actions, which may allow for performance optimization.

Other DB types allow the addition and removal of sequences, and access to sequences through an integer index, or through the sequence name. When importing sequences, the SequenceDB implementation that is used is determined by a SequenceDBBuilder. In addition to constructing and populating a DB, this builder may perform other actions on the sequences, such as sorting or filtering. The DBbuilder is provided with a List of sequence data Maps, a SequenceBuilder, and (optionally) a SequenceConverter. This gives the DBbuilder the potential to tightly control the process, while the default builder simply passes each sequence data Map to the sequence builder together with the converter and stores the resulting Sequence object in the DB.

The sequence IO system

The IO system is designed to allow reading sequence data from and writing data to a wide selection of file formats. The design is modular in order to allow individual components to be changed to slightly change the manner in which sequence data is processed.

Reading sequences from file

As sequences are created by providing a builder with a Map of property values, the task of the sequence input is to create a list of such maps. These may then be used directly to create the sequences with a provided builder, or be sent along with the builder to a DBbuilder for special processing.

The input data stream is read by a SequenceDBReader which is supplied with a SequenceInputFormatter. In general, the formatter reads individual sequence records, and creates a map of data to be sent to a SequenceBuilder. A simple reader thus opens the stream, and calls the formatter repeatedly until all sequences have been read. A reader may also begin with reading a set of header data from the stream, and then read the sequence records, creating the data maps based somehow on the data found in the header as well as the individual records. This system should be flexible enough to read from most structured text-based file formats. The reader may however choose to perform the read in some other fashion, not using a formatter, allowing for other file formats as well. A SequenceDBInputFormat object defines a file format by providing both a SequenceDBReader and a SequenceInputFormatter.

The SequenceIO class provides a few utility methods to read sequences from file, using default implementations of the input format classes. Reading a Fasta-formatted file without any extra modification may thus be done in any of the followng ways:

//The input file
File inputFile = new File("example.fas");
	
//The sequence builder that creates the sequence type of interest
SequenceBuilder builder = new SimpleDNASequenceBuilder();

//Get a sequence DB with the default implementations of 
//SequenceDBBuilder and SequenceDBReader, and no SequenceConverter
SequenceDB db = SequenceIO.getSequenceDB(
  new FileInputStream(inputFile),       //InputStream
  null,                                 //SequenceDBBuilder
  null,                                 //SequenceDBReader
  new FastaFormatter(),                 //SequenceInputFormatter
  null,                                 //SequenceConverter
  builder);                             //SequenceBuilder
      
//This is equivalent to:
SequenceDB db1 = SequenceIO.getSequenceDB(
  new FileInputStream(inputFile),       //InputStream
  ListSequenceDB.getDefaultBuilder(),   //The default SequenceDBBuilder
  SequenceIO.defaultDBIO,               //The default SequenceDBReader
  new FastaFormatter(),                 //SequenceInputFormatter
  null,                                 //SequenceConverter
  builder);                             //SequenceBuilder
      
//Using a SequenceDBInputFormat for reader and formatter:
SequenceDB db2 = SequenceIO.getSequenceDB(
  new FileInputStream(inputFile),       //InputStream
  null,                                 //SequenceDBBuilder
  new FastaDBFormat(),                  //SequenceDBInputFormat
  null,                                 //SequenceConverter
  builder);                             //SequenceBuilder
      
//or to just get a Collection
Collection c = SequenceIO.getSequences(
  new FileInputStream(inputFile), 	//InputStream
  new FastaFormatter(), 		//SequenceInputFormatter
  builder);      			//SequenceBuilder

The DNASeqIO class also provides convenience methods for creating the default implementation of sequences:

  DNASeqIO.getDNASequences(          
    new FileInputStream(inputFile),       //InputStream
    new FastaFormatter());                //SequenceInputFormatter

Writing sequences to file

Writing sequences is easier, since we need not care about what implementation of sequence or DB we have. Analogous to input, sequence output uses a SequenceDBWriter and a SequenceOutputFormatter to do the job. The output formatter generates records from a sequence, while the writer outputs these sequences along with any special file formatting such as file headers with general properties.

Convenience methods are available for output as well, as shown:

//Using the default implementation of SequenceDBWriter, and no converter
SequenceIO.putSequenceDB(
  new FileOutputStream(outputFile),     //OutputStream
  db,                                   //The SequenceDB to write
  null,                                 //SequenceDBWriter
  null,                                 //SequenceConverter 
  new FastaFormatter(),                 //SequenceOutputFormatter
  null);                                //Optional Map of properties for the writer
      
//This is equivalent to
SequenceIO.putSequenceDB(
  new FileOutputStream(outputFile),     //OutputStream
  db,                                   //The SequenceDB to write
  SequenceIO.defaultDBIO,               //SequenceDBWriter
  null,                                 //SequenceConverter 
  new FastaFormatter(),                 //SequenceOutputFormatter
  null);                                //Optional Map of properties for the writer
      
//Omitting the writer (same as providing a null writer
SequenceIO.putSequenceDB(
  new FileOutputStream(outputFile),     //OutputStream
  db,                                   //The SequenceDB to write
  null,                                 //SequenceConverter 
  new FastaFormatter(),                 //SequenceOutputFormatter
  null);                                //Optional Map of properties for the writer

//Using a SequenceDBOutputFormat
  SequenceIO.putSequenceDB(
  new FileOutputStream(outputFile),     //OutputStream
  db,                                   //The SequenceDB to write
  null,                                 //SequenceConverter 
  new FastaDBFormat(),                  //SequenceDBOutputFormat
  null);                                //Optional Map of properties for the writer

The MolTools Java library

Site menu:

Related projects:

Links: