promanal

　　　　　植物分子生理学研究室

　　　　　　　　Lab for Plant Molecular Physiology, Fac Appl Biol Sci, Gifu University

promanal

a tool for LDSS analysis

This is a command line program. Typically, it reads 10,000 promoter sequences with 1 kbp length each, counts appearance of short sequences according to promoter positions, and output total count of the sequence. Because all the possible short sequences are analyzed, 6mer analysis gives distribution profiles of 4^6 (=4,096) sequences and 8mer does of 4^8 (=65,536) sequences.

Reference: Yamamoto YY*, Ichida H, Matsui M, Obokata J, Sakurai T, Satou M, Seki M, Shinozaki K, Abe T (2007) BMC Genomics 8:67

how to use (arguments)

-I Input Sequence File {REQUIRED}

-D Rawdata directry (default: ./)

-R Rawdata output [ON/OFF] (default: ON)

-F Summary filename (default: ./summary.txt)

-S Motif size (default: 6)

-A Window size (default: 10)

-L Incrementer prefix size (default: 1)

-T Number of threads [1-16] (default: 1)

promoter sequence file

A promoter sequence file is expected to be a multi-fasta file as follows.

>promoter 1

GAGAGAAA....

TCCCAAAAA...

>promoter 2

TCTCTCTC..

AGAGAGA...

All the sequences need to have the same length.

-D output directory

In the specified output directory, four subdirectories (A, C, G, T) are created. Within each directory, many files are created. Each file corresponds to one short sequence. File name corresponds to the sequence.

-S length of analyzing short sequence

Length of analyzing short sequence is typically 6 to 8 bases. For hexamer analysis, type “6” for the the argument. Shorter length like dimer or trimer is also acceptable. Application of longer sequences like 9 or 10 have not been tried by us.

One problem with long sequences might be number of output files. Some of the file systems limit maximum number of files in one directory. Mac HSF: max 65,535 files, Mac HSF+: no limitation.

Another problem with long sequences is less finding in the promoter DB. Very rare sequences are to be excluded from LDSS analysis because of a statistical reason. Therefore, in general, application of longer sequences leads to decrease of appearance in the promoter DB, and of the rate of sequences acceptable for LDSS analysis.

output

There are two kinds of output data. One is a set of thousands of files for accurate (raw) data. One file corresponds to a distribution profile of one short sequence. The other is one summary table file that can be opened with Excel (tab delimitation). Data points of raw data are about 1,000 if the promoter length is 1 kb. Because Excel can handle less than 256 rows, you might need to reduce data points. Data points of the summary table can be adjusted by -A option. Average of specified window size will be shown in the table. An example of conditions is:

promoter length: 1 kbp

promoter number: 10,000

length of analyzing short sequence: 6 (-S option)

window size for summary table: 10 (-A option)

This will give a summary table with a size of ~ 4,000 x 100. This file can be opened with Excel.

Excel can handle max 65,536 columns, that is exactly the same number of octamer sequences. This means that you can manage the results of octamer analysis with Excel. Unfortunately, the summary table contains one line extra for the label, so you need to remove the first line before opening by Excel.

Sequences undetected in the promoters don’t produce corresponding output files.

Using Excel, you can also open raw (not average) data of a specific sequence. Look for your sequence by file name in a subdirectory.

check

One easy check point of the analysis is looking at a distribution profile of TATA box. A TATAAA-containing sequence is expected to have a sharp peak around -30 bp in most (all?) eukaryotic organisms. If you don’t see any peak, you might need to reexamine preparation of promoter sequences.

problem?

Numbering of positions in the output file is determined according to the longest sequence in the promoters. If the output results appear to contain extra data, check the promoter sequence file to confirm that all the sequence have the same length.

Analysis is usually done in minutes to hours. If the program does not finish in a day, it might be stuck. In that case, force to quit the program and check the promoter sequence file before retry. An earlier version of the program had problem with N-containing sequence files. This specific problem has been fixed, but it may not properly handle input files with unexpected shape.

Highly Cited Researchers of Gifu University