promanal

 

promanal


a tool for LDSS analysis



This is a command line program. Typically, it reads 10,000 promoter sequences with 1 kbp length each, counts appearance of short sequences according to promoter positions, and output total count of the sequence. Because all the possible short sequences are analyzed, 6mer analysis gives distribution profiles of 4^6 (=4,096) sequences and 8mer does of 4^8 (=65,536) sequences.


Reference: Yamamoto YY*, Ichida H, Matsui M, Obokata J, Sakurai T, Satou M, Seki M, Shinozaki K, Abe T (2007) BMC Genomics 8:67


how to use (arguments)

-I      Input Sequence File {REQUIRED}

-D      Rawdata directry (default: ./)

-R      Rawdata output [ON/OFF] (default: ON)

-F      Summary filename (default: ./summary.txt)

-S      Motif size (default: 6)

-A      Window size (default: 10)

-L      Incrementer prefix size (default: 1)

-T      Number of threads [1-16] (default: 1)


promoter sequence file

A promoter sequence file is expected to be a multi-fasta file as follows.


    >promoter 1

    GAGAGAAA....

    TCCCAAAAA...

    >promoter 2

    TCTCTCTC..

    AGAGAGA...

All the sequences need to have the same length.


-D output directory

In the specified output directory, four subdirectories (A, C, G, T) are created. Within each directory, many files are created. Each file corresponds to one short sequence. File name corresponds to the sequence.


-S length of analyzing short sequence

Length of analyzing short sequence is typically 6 to 8 bases. For hexamer analysis, type “6” for the the argument. Shorter length like dimer or trimer is also acceptable. Application of longer sequences like 9 or 10 have not been tried by us.


One problem with long sequences might be number of output files. Some of the file systems limit maximum number of files in one directory. Mac HSF: max 65,535 files, Mac HSF+: no limitation.


Another problem with long sequences is less finding in the promoter DB. Very rare sequences are to be excluded from LDSS analysis because of a statistical reason. Therefore, in general, application of longer sequences leads to decrease of appearance in the promoter DB, and of the rate of sequences acceptable for LDSS analysis.


output

There are two kinds of output data. One is a set of thousands of files for accurate (raw) data. One file corresponds to a distribution profile of one short sequence. The other is one summary table file that can be opened with Excel (tab delimitation). Data points of raw data are about 1,000 if the promoter length is 1 kb. Because Excel can handle less than 256 rows, you might need to reduce data points. Data points of the summary table can be adjusted by -A option. Average of specified window size will be shown in the table. An example of conditions is:

     promoter length: 1 kbp

     promoter number: 10,000

     length of analyzing short sequence: 6 (-S option)

     window size for summary table: 10 (-A option)

This will give a summary table with a size of ~ 4,000 x 100. This file can be opened with Excel.


Excel can handle max 65,536 columns, that is exactly the same number of octamer sequences. This means that you can manage the results of octamer analysis with Excel. Unfortunately, the summary table contains one line extra for the label, so you need to remove the first line before opening by Excel.


Sequences undetected in the promoters don’t produce corresponding output files.


Using Excel, you can also open raw (not average) data of a specific sequence. Look for your sequence by file name in a subdirectory.


check

One easy check point of the analysis is looking at a distribution profile of TATA box. A TATAAA-containing sequence is expected to have a sharp peak around -30 bp in most (all?) eukaryotic organisms. If you don’t see any peak, you might need to reexamine preparation of promoter sequences.


problem?

Numbering of positions in the output file is determined according to the longest sequence in the promoters. If the output results appear to contain extra data, check the promoter sequence file to confirm that all the sequence have the same length.


Analysis is usually done in minutes to hours. If the program does not finish in a day, it might be stuck. In that case, force to quit the program and check the promoter sequence file before retry. An earlier version of the program had problem with N-containing sequence files. This specific problem has been fixed, but it may not properly handle input files with unexpected shape.



日本語メニュー
日本語トップページ
研究業績
メンバー
行事予定
学会発表
Journal Club
担当講義
外部資金・研究
プロジェクト
Journal Site link
selected articles
guide of reporter genes
書評始めました
tea time
研究設備
植物分子生理学特別セミナー
解析サーバー
受賞
研究紹介
大学院入学
希望者情報
実験五則
shan_ben_yan_jiu_shi.html
research_j.html
application.html
publication.html
member.html
grant.html
labsetup.html
server.html
award.html
schedule_current.html
meeting.html
seminar.html
Journal_Club.html
Journal_link.html
lecture.html
selected_article.html
reporter_guide.html
books_intro.html
tea_time_0.html
Experiment.html

Highly Cited Researchers of Gifu University

top30.html
国際細胞共生
学会
ISE.html
有名人研究者
思い出話
tea_time_67.html