Monday, March 16, 2015

Informatics and Information Technology for Human Genome

Greatest Achievements of the 20th century: Airplane (1903), Nuclear Power (1945), Computer (1948), Television (1948), Integrated circuit (1958), The structure of DNA (1958), The Internet (1969), Microprocessor (1971), The Web (1991), Human Genome - DNA sequences (2001).

„In 1962, scientists Francis Crick and James Watson were awarded the Nobel prize for their roles in discovering the structure of DNA, which is an acronym for deoxyribonucleic acid. Anything that is alive, from bacteria to elephants, has DNA. DNA stores genetic material and passes it on to the next generation. A copy of a living entity's DNA is passed to developing offspring. Once the DNA is passed to the developing offspring, it is used to make that offspring's body parts. DNA's structure allows a molecule to copy itself and allows genetic messages to arrive without getting garbled on the journey. DNA is a huge molecule called a macromolecule. However, DNA fits into small cells because it is packed in a process called supercoiling, in which DNA is wrapped around proteins called nucleosomes. Proteins called histones hold the coils together.” Source:, Image Source: Andrew Hessel - Hessel

The Human Genome Project (1995-2005) 

STARTING: The revolution began in 1991 when at the National Institutes of Health Dr. Venter and his team developed expressed sequence tags (ESTs), a new technique to rapidly discover genes. The team at decided to use their new computing and computational tools, as well as new DNA sequencing technology, to sequence the first free living organism, Haemophilus influenzae in 1995.
FINISH: „The Human Genome Project (HGP) was one of the great feats of exploration in history - an inward voyage of discovery rather than an outward exploration of the planet or the cosmos; an international research effort to sequence and map all of the genes - together known as the genome - of members of our species, Homo sapiens. Completed in April 2003 (2001 – The first draft of the human genome is completed), the HGP gave us the ability, for the first time, to read nature's complete genetic blueprint for building a human being.”
2003 year, the 50th anniversary: Designating April 2003 as "Human Genome Month" and April 25, 2003, as "DNA Day" (Senate Congressional Resolution Designating National DNA Day); April 25, 2003, will mark the 50th anniversary of the description of the double-helix structure of DNA by James D. Watson and Francis H.C. Crick, considered by many to be one of the most significant scientific discoveries of the 20th Century -
  • The Human Genome Project began in 1990 and is supposed to be finished in 2003-2005
  • The International Human Genome Project (HGP) ( was developed in collaboration with the United States Department of Energy and begun in 1990 to map the human genome
  • The world's attention was perhaps most keenly focused on the sequencing and analysis of one genome — the human — which was published in 2001 by Dr. Venter and his team at Celera Genomics - 
  • The National Human Genome Research Institute began as the National Center for Human Genome Research (NCHGR) - About:
Organizations and Research Projects
1. The National Human Genome Research Institute, National Center for Human Genome Research (NCHGR) -, The International Human Genome Project (HGP): , DNA Structure:
- Genome Informatics and Computational Biology Program (Model Organism Databases, International HapMap Project, Genome Sequencing Informatics Tools (GS-IT) -,

2. The J. Craig Venter Institute (was formed in October 2006) - , About: ; For more than two decades Dr. J. Craig Venter and his research teams have been pioneers in genomic research. The revolution began in 1991 when at the National Institutes of Health Dr. Venter and his team developed expressed sequence tags (ESTs), a new technique to rapidly discover genes. Groups Recearch: Genomic Medicine; Infectious Disease; Microbial; Environmental Genomics; Plant Genomics; Synthetic Biology; Bioenergy
- Informatics is at the heart of any genomics research enterprise as it is critical for the analysis of large data-sets
- Platforms: Informatics; Information Technology; Sequencing -
- Information Technology: "The compute infrastructure boasts over 3 Terabytes of RAM distributed among its 119 nodes with 1,140 CPU cores. Storage capacity has grown to over half a Petabyte. We also maintain a gigabit Ethernet uplink to Internet2, a next generation high-speed optical network designed to facilitate collaboration and communication among research institutions." -
- Software packages:

3. The New York Genome Center (NYGC) is at the forefront of transforming biomedical research and clinical care with the mission of saving lives. About: ; „Our member organizations are united in this unprecedented collaboration of technology, science, and medicine”.
-"Our standard server contains either 16 or 20 CPU cores, 256GB of RAM, and 5TB of local storage. Our compute cluster consists of over 100 of these servers offering over 2000 cores. Fact: The raw data files generated from sequencing a single human genome occupy approximately 130 gigabytes of disk. For comparison, a single high definition movie might take up 3 or 4 gigabytes. In the first five years of operation, we expect to generate nearly 5,000 terabytes, or 5 petabytes, of data files." -

Human Genome Project 2.0 ?
New phase of the Human Genome Project be undertaken -- the writing of a synthetic human genome.” Andrew Hessel, Co-Chair , Bioinformatics and Biotechnology at the Singularity University
„Time to write human genomes: In 1990, scientists launched one of the largest international collaborations in life science, an effort to sequence the human genome. This work inspired a new generation of genetic researchers and led to countless breakthroughs in new technologies, informatics, and medicine. I was one of the researchers who had their careers launched by this effort. Now, in 2012, there’s a need for a new grand challenge, and I can think of none better than to have scientists organize to write a human genome. If you think this is hard, remember, it’s what every cell has to do each time it divides. My thoughts on the next human genome project were published in the Huffington Post.” Andrew Hessel – programming life -

1. The Sequence of the Human Genome, Science, J. Craig Venter, Mark D. Adams, Eugene W. Myers, ..., Chunlin Xiao, Chunhua et al. Yan, vol 291, pp.1145-1434, 2001 –
2. On the quantity and quality of single nucleotide polymorphisms in the human genome, Richard Durrett and Vlada Limic (Department of Mathematics, Cornell University, Ithaca NY 14853, USA), Stochastic Processes and their Applications , Volume 93, Issue 1, Elsevier, May 2001, Pages 1–24, online -
5. Istrail LAB,

Romanian Genetics - Major studies of Romanians
"In comparison with other European ethnic groups, it is comparatively hard (as of 2012) to find detailed data on Romanian DNA. Nevertheless, I was able to find the following estimates for the frequencies of Y-DNA haplogroups among Romanian people: 7.4% E, 5.6% G, 22.2% I or I2, 5.6% J, 20.4% R1a, 13% R1b.
- Haplogroup I is also common in Bosnia and Herzegovina, in Serbia, in Croatia, in Sardinia, and in Scandinavian countries. Some of these peoples, especially Sardinians and Serbs, are believed to primarily descend from among the earliest European settlers, prior to successive waves of new immigrations from the Near East.
-Haplogroup I2a1b is found in especially high frequencies in northeastern Romania, Moldova, and central Ukraine. I tried to parse out ethnic Romanians in Family Tree DNA's "Romania" group. As far as I can tell, for non-Jewish non-Hungarian non-German non-Roma ethnic Romanians their Y-DNA haplogroups include E1b1b1, G, I1, I2a, Q, and R1b1a2. This is in line with the expectations from the frequencies above. Their mtDNA haplogroups include H, H7, I, J, K1c1, M, U4, and U5." [1].
  • "Phylogeography of Y-chromosome haplogroup I reveals distinct domains of prehistoric gene flow in Europe." American Journal of Human Genetics 75:1 (July 1, 2004): pages 128-137. First electronically published on May 25, 2004. 
This study found that 22% of the people of Romania and Moldova belong to the Y-DNA haplogroup I.
"The area between the Dniester and the eastern Carpathian mountain range is at a geographical crossroads between eastern Europe and the Balkans. Little is known about the genetics of the population of this region. We performed an analysis of 12 binary autosomal markers in samples from six Dniester-Carpathian populations: two Moldavian, one Romanian, one Ukrainian and two Gagauz populations. The results were compared with gene frequency data from culturally and linguistically related populations from Southeast Europe and Central Asia. Small genetic differences were found among southeastern European populations (in particular those of the Dniester-Carpathian region). The observed homogeneity suggests either a very recent common ancestry of all southeastern European populations or strong gene flow between them. Despite this low level of differentiation, tree reconstruction and principle component analyses allowed a distinction between Balkan-Carpathian (Macedonians, Romanians, Moldavians, Ukrainians and Gagauzes) and eastern Mediterranean (Turks, Greeks and Albanians) population groups. The genetic affinities among Dniester-Carpathian and southeastern European populations do not reflect their linguistic relationships. The results indicate that the ethnic and genetic differentiations occurred in these regions to a considerable extent independently of each other. In particular, Gagauzes, a Turkic-speaking population, show closer affinities to their geographical neighbors than to other Turkic populations." Alexandru Varzari, 2006 [2].

1. The American Center of Khazar Studies (A Resource for Turkic and Jewish History in Russia and Ukraine),
2. Alexandru Varzari's dissertation "Population History of the Dniester-Carpathians: Evidence from Alu Insertion and Y-Chromosome Polymorphisms" for the Faculty of Biology Ludwig-Maximilians-Universität on July 27, 2006,

No comments: