Origins of CML

I shall be writing this blog mainly in the first person but you realise that CML is the joint product of Henry Rzepa and myself over many years. Simply, CML would not have happened without Henry. Perhaps in 2009 I am the more active contributor but it’s a joint creation. So, if ever I write things that appear to be just due to me please mentally replace this by PMRz – what I call our symbiote.

What are the origins of CML? I think I go back to ca 1980 when I was writing code to extend Sam Motherwell’s great FORTRAN toolkit for the Cambridge database – BIBSER (biliograpgraphic search), CONNSER – the first and greatest chemical substructure algorithm, and GEOM78 – a geometry calculation tool. In 1976-1978 I used to visit Cambridge (from Stirling) and work with Sam on extracting structures from the database and analysing them. There was a rough division of labour and ideas. I came with a number of ideas and Sam would modify CONNSER and GEOM to support these – literally within a day or so. Recall (if you were alive then) that this involved overlays, blank COMMON, EQUIVALENCE and other mindbending ways of dumping the core.

I took the problems back to Stirling and “integrated” Sam’s output with SPSS (a well known stats package from the social sciences). Not surprisingly this can of spaghetti started to get out of control. I did great amounts of analysis on the floor of our living room with an acoustic modem (http://en.wikipedia.org/wiki/Acoustic_coupler) where the handset was plugged into rubber cups. It used to run at 110 baud (110 bits, yes, bits per second). Since a character took 11 bits (to check and repair loss) we got 10 characters a second.

The sums were originally done at Cambridge (on Phoenix) but I ported the software to UMRCC (the regional computing Centre at Manchester) on a CDC 7600. The results were printed out on folding “line printer” paper on a boustrophedonic teletype (ASR33?) I would then extract data by hand and enter them into stats programs, but gradually moved to doing the stats remotely. Remote graphics was always difficult – we could get printer plots from Aberdeen but it took a week. So I generally evolved ASCII plots.

The point of this is that during these sessions I had a lot of time to think about how to do it better. It was obvious the software had to be modular and I gradually got to thinking about modular data. Remember we used FORTRAN IV which did little to encourage modular.

In 1981/2 I spent a sabbatical with Jenny Glusker in Philadelphia and there developed a VAX version of the software. I extended this to plot aggregations of data in 2 and 3 dimensions. Again the idea of modular components was clear. I returned to start up molecular modelling / computer graphics in Glaxo and found myself working with a completely different set of files – ChemX, Mol, etc. I couldn’t use these with my analysis code. It seemed completely wasteful not to have a common format, so I started an activity within the Molecular Graphics Society to systematize file types.

In effect this was an attempt to build a chemical ontology. The word hadn’t been used then within science and it wouldn’t have helped if it had. I didn’t get much take up and there was active resistance from some software companies who regarded their formats as a commercial weapon.

However the crystallographers had a much more unified view of the world. I continue to congratulate the Int. Union of Crystallography for its efforts in this area. In the mid 1980’s there was an active group – led by David Brown – to create a self-defining format for crystallography. It was, I think, called CSF (Crystallographic Structure File) but I’m open to correction. It was essentially tagged data – controlled vocabulary – applied to scalars and arrays.

This then developed into to the CIF, which is now the standard method of exchanging crystallographic information. This was based on data supported by data dictionaries which themselves were constrained to a dictionary definition specification. I started to use the CIF approach to model my scientific world – this was long before XML but it was essentially isomorphic to XML.

During this period I had gradually advanced my language skills from FORTRAN to BBC-BASIC and C (Part of this was through teaching the MSc in Birkbeck.) So when C++ came along (1991?) I translated my approach to C++ and started to develop a toolkit/library. That’s probably effectively when CML as a data modelling approach started.

I started with the most obvious components – geometry and numbers. These are still an integral part of CML (the “Euclid” library). This was then extended to molecules, atoms and bonds and by ca 1993 I had a set of objects. But I needed a way to display and manipulate them.

At that stage I met Henry Rzepa – I’m not sure how. Henry visited me at Greenford and we found we had a common interest in the Internet and its power for disseminating chemistry. It must have been about the time of Mosaic – 1993. Anyway Henry and I went to the first WWW meeting – he ran a session on chemistry and I one on biology. We had an early version of RasMol which ran on UNIX and Henry had prepared a demo. We had it running the day before on a CERN machine but when we cam back the next day someone had wiped the shared libraries to save space. We got the thing running again 5 mins after Henry’s talk started.

The theme of WWW1 was, of course, the use of HTML (and HTTP) to create distributed information. Because it was in CERN and all HTTP sites at that stage were academic the emphasis was all on science. How could you carry maths in HTML? And so how could you do the same for chemistry? We didn’t know how.

But Henry went to WWW2 later that year and came back with the idea that the future of the web would be “SGML”. And that is when I started to cast my objects into SGML. So probably late 1994 is when Chemistry and Markup Languages came togther.

Later posts will take it from there.

6 Responses to Origins of CML

Henry Rzepa says:

January 16, 2011 at 9:37 am

Checking through some ancient files on our web server, I discovered that we first went public with CML on 21 August, 1995, in the form of an ACS poster. I notice that it introduced the concept of what is now referred to as data round tripping. The idea was to formalise and normalise both the input and output of a computer program so that the latter could reliably serve as an input.

I was in Chicago at the time, and Peter was back in the UK, at a terminal, waiting for comments from the audience to come flooding in! In fact, when we got to the Hotel room that the ACS session occurred in, we discovered no trace of any Internet connection (anywhere) and could not communicate. Peter sat in an unrequited silence throughout the entire presentation!

Henry Rzepa says:

January 16, 2011 at 9:42 am

I might add a little information on how Peter and I met. Probably around 1992-1993, the student chemical society at Imperial College invited Peter to give a talk. He chose the topic of crystallography, but in characteristic fashion, delivered a scintillating talk covering, well probably almost all of chemistry! I chatted to him after his talk, and one of us must have mentioned the Internet. The topic might have been Gophers (anyone remember them?) and what their potential was. This, in both our memories, is now immortalised by our next meeting, in the Black Horse pub in Greenford.

Max Bachrach says:

January 19, 2011 at 10:24 pm

I found the link to this blog on sourceforge, which provides your code and schemas for working with CML. The code is in Java: do C or C++ equivalents exist?

- Dmitry says:
  
  May 31, 2011 at 7:26 pm
  
  Have a look at OpenBabel.
  
ralf says:

March 1, 2012 at 5:57 pm

I hope this is not too much off topic:
I look for a recompiled file (probably xml) or open database connection to access all the standard properties (radius, typical charge, ionization energy, …) from the elements and small molecules (CO2, H2O, … ) as input for a little program. Does anyone know where to get something like this? If not, would you compile such a file with me? cheers, Ralf

Dmitry says:

March 2, 2012 at 7:05 pm

Hello!
I try to use chemical formulas in Docbook technology.
I’ve found but I couldn’t find any relevant information.
Is it poosible to use ChemML in Docbook?

6 Responses to Origins of CML

Leave a Reply Cancel reply

Meta

Categories

Recent Comments