Zhiliang's Workbench:
Information / progress track
Major works performed: May 20, 2011 - June 15, 2011
Project: Pig array annotations - comparisons across 14 platforms
GOALS: - Blast link array elements to match their annoations
- Analyze the annotations
DATA PREP:
1. A-MEXP-693.adf: In-house printed array using the Illumina RefSet human
oligonucleotide collection (link)
2. GPL1881.txt: Qiagen-NRSP-8 porcine oligo array
3. GPL3594_GPL3585_GPL6173.txt: GPL3585: DIAS_PIG_55K2_v1
GPL3594: DIAS_PIG_27K2_v2
GPL6173: DJF__Pig_55K__v1
(which could be called "combined Danish microarray set")
4. GPL3764.txt: Porcine oligo microarray version 3 (POM3)
5. GPL4930.txt: In-house made at "University of Illinois at Urbana-Champaign,
Urbana IL" (link)
6. GPL5622.txt: SLA-RI/NRSP8-13K (which could be called "SLA_PrV porcine
DNA/cDNA microarray") this is the GEO name
7. GPL7151.txt: SLA/Immune Response/NRSP8 Pig 70 mers Oligonucleotides
3.8K + 13.3K v1
8. GPL7435.txt: Swine Protein-Annotated Oligonucleotide Microarray
(Illumina Oligo synthesis)
9. GPL7576.txt: Porcine oligonucleotide microarray version 4 (POM4)
(Condensed version)
10. GPL8448.txt: USDA/APHIS/FADDL Pan-viral 15K v4.2(Agilent)
11. GPL9710.txt: Operon Pig 14.4K genome microarray v1.0.2 (aka 13K NRSP8 oligo array)
and effectively there is the 2 Affy platforms
(GPL9710 overlaps almost completely with GPL1881. They are
different spottings of the same oligo set from Qiagen-Operon)
The Affy Array:
2010: - SNOWBALL_array_seqs.fa -- Already annotated by FIOS
- miRNAs_array_seqs_v4.fa |- The three sections of the 2010
- unique_coding_seqs_for_array_v4.fa | Affy platform, which will be merged
- virus_genomes_array_seqs_v4.fa | at some point into one "SNOWBALL"
| platform (Chris Tuggle; 2011).
[2012 update: Freeman et al., 2012]
2005: - affy_ssc_consensus: 23935 sequences |- 2005 data; Added on Oct 06, 2011
- affy_ssc_target: 24123 sequences |- 2005 data; Added on Oct 06, 2011
- newAffy_probes: 599981 sequences |- 2011 data; Added on Oct 10, 2011
APPROACH:
Since the IPA is the most recent, inclusive data set (combining all previously
known Affy data sets), and has been well annotated, the basic approach is to
blast match all "other" affy platforms to IPA to get an idea how they link to
each other.
PREVIOUS ANNOTATIONS:
1. annot_Affy_20k.csv (original file name: "Affy_20k_annot.csv"):
(from Oliver)
2. annot_Affy_20k_hs.csv (original file name: "Affy_20k_annot_human.csv"):
(from Oliver)
3. annot_IPA.csv (original file name: "ipa_annot.csv"):
(from Oliver, replacing "annot_Affy_20k.csv" and "annot_Affy_20k_hs.csv")
4. annot_snowball_20110509 (original file name: "snowball_annotation_20110509")
"snowball" annotation from Dario- snowball is the name of the new pig affy chip.
The original annotation which I think comes from Affy; not sure. The dates can
be confusing on these files. This one is 08-03-11 which means March 8 2011.
Dario's dates means May 9 2011 (Chris Tuggle).
WORK LOCATIONS:
Project dir: DELL:/home/hu/projects/Tuggle_annot
Blast dir: DELL:/cluster/nagrp/run/ or
CLUSTER:/storage/nagrp/run/
IPA Annotations: MySQL::annotdb::IPAannot (569,378 rows)
Planned database: MySQL::host_arrayanno::Iblasted (not used)
Results dir: DELL:~apache/doc/pig/projects/array_annotatn
Results db: MySQL::annotdb::Iblasted
PROGRESS/STATUS:
1. Blast:
Initial considerations: try to see the "coverage" - to be conservative
such that it will be more "inclusive".
Threshold: E-value cut off: 1e-3; Priming seq length: 15 bp
May 15, 2011: Initial blast
FASTA sequence files; each platform | Number of sequences | Number of sequences that has at least 1 hit to IPA |
A-MEXP-693.fa | 22800 | 7174 |
GPL1881.fa | 13677 | 13132 |
GPL3594_GPL3585_GPL6173.fa | 26035 | 4242 |
GPL3764.fa | 1818 | 1764 |
GPL4930.fa | 13297 | 13132 |
GPL5622.fa | 17459 | 16795 |
GPL7151.fa | 17459 | 16795 |
GPL7435.fa | 20400 | 19489 |
GPL7576.fa | 357 | 346 |
GPL8448.fa | 14985 | 864 |
GPL9710.fa | 14057 | 13324 |
SNOWBALL_array_seqs.fa | 1091987 | 752423 |
miRNAs_array_seqs_v4.fa | 2370 | 336 |
unique_coding_seqs_for_array_v4.fa | 47769 | 58481 |
virus_genomes_array_seqs_v4.fa | 35 | 4 |
IPA.fa | 639177 | |
May 18, 2011: Decision from the conf call:
- Use evalue cut off of 1e-10 (pig to pig; human to refseq)
- Leave out "virus_genomes_array_seqs_v4.fa"
Jun 01, 2011: 2nd results: Blast at 1e-10
2. Blast results trim (by minimum overlap lengths/ % identity):
Jun 04, 2011: Filter results by Score > 40
Alimtlenth > 20
Identity > 80%
Jun 07, 2011: New Blast data summary
FASTA sequence files; each platform | Number of sequences | Number of sequences that has at least 1 hit to IPA |
A-MEXP-693 | 22800 | 7174 |
GPL1881 | 13677 | 13324 |
GPL3594-3585-6173 | 26035 | 34676 |
GPL3764 | 1818 | 1764 |
GPL4930 | 13297 | 13132 |
GPL5622 | 17459 | 16795 |
GPL7151 | 17459 | 16795 |
GPL7435 | 20400 | 19489 |
GPL7576 | 357 | 346 |
GPL8448 | 14985 | 864 |
GPL9710 | 14057 | 13324 |
v4_coding.seq | 47769 | 58217 |
v4_miRNAs | 2370 | 336 |
v4_SNOWBALL | 1091987 | 752423 |
IPA.fa | 639177 | |
3. Port the blast results to database:
Jun 08, 2011: done.
4. Build db queries to integrate the combined annotation matches
Jun 09, 2011 / Jun 15, 2011:
Query 1: The match metrix of the 14 platforms elements to IPA, with annotations
o The output is limited to 200 for preview;
o Download the entire data set in this format, tab delimited, here
Query 2: Blast match of the positive blast hits to IPA, with blast scores
o The output is limited to 200 for preview;
o Download the entire data set in this format, tab delimited, here
5. * Fine tune to tweak for optimal matches thresholds
* Follow up works to improve the matches
Oct 3, 2011: Added Gene IDs (NCBI) and Symbols to the Query 1 results
Oct 6, 2011: Added Affy platform (see updated Query 1 link for results).
The overall platforms are now:
1. A-MEXP-693
2. Affy_consensus / Affy_target <-- NEW!
3. GPL1881
4. GPL3594-3585-6173
5. GPL3764
6. GPL4930
7. GPL5622
8. GPL7151
9. GPL7435
10. GPL7576
11. GPL8448
12. GPL9710
13. v4_coding.seq
14. v4_miRNAs
15. v4_SNOWBALL
Oct 10, 2011: Added "new Affy" platform (see updated Query 1 link for results).
The overall platforms are now:
1. A-MEXP-693 7174
2. affy_consensus 27897 <-- NEW!
affy_target 24717 <-- NEW!
3. GPL1881 13324
4. GPL3594-3585-6173 34676
5. GPL3764 1764
6. GPL4930 13132
7. GPL5622 16795
8. GPL7151 16795
9. GPL7435 19489
10. GPL7576 346
11. GPL8448 864
12. GPL9710 13324
13. newAffy 252 <-- NEW!
14. v4_coding.seq 58217
15. v4_miRNAs 336
16. v4_SNOWBALL 752423
x) ...
6. Wrap up:
May 22, 2012: The outcome are further analysed by Bouabid Badaoui at Parco
Tecnologico Padano - CERSA, Italy. A manuscript was subsequently
developed. The results from this pipeline are used as "Supplementary
Data" to the publication (public site):
http://www.animalgenome.org/repository/pub/ITALY2013.0312
CONTACT: Chris Tuggle
LAST UPDATE: May 2012
Reference:
- Freeman TC, Ivens A, Baillie JK, Beraldi D, Barnett MW, Dorward D, Downing A, Fairbairn L, Kapetanovic R, Raza S, Tomoiu A, Alberio R, Wu C, Su AI, Summers KM, Tuggle CK, Archibald AL, Hume DA (2012).
A gene expression atlas of the domestic pig. BMC Biology 2012, 10:90
May 20, 2011 - June 15, 2012 • Zhiliang Hu •