Anyone looking for a consultant/contract programmer with 20+ years experience in software development for cheminformatics? I'm available. Expert Python developer. Sweden-based. Some projects I worked on are chemfp, mmpdb, RDKit's MCS algorithm, OpenSMILES, and VMD.
Just read a recent fp paper where the authors did not implement their core algorithm correctly. I emailed them the corrected version, then pointed out the one in RDKit is 60x faster. Self-correcting science, stepping on toes, or both? Don't feel like naming/shaming.
@rguha Which cheminf packages do you use which have them?
I know OEChem has some support.
If it's info you can make public, do your in-house code bases generally have them?
How important/useful are Python type annotations to #cheminformatics programmers? I'm toying with the idea of adding them to chemfp.
Are they useful in this field yet? Neither RDKit nor Open Babel support them, so code using those libs won't type check well, as I understand it.
Chemfp (https://t.co/ldew8zh8sC) is a Python package for fingerprint generation (using one of four cheminf toolkits), fast similarity search, diversity selection, and clustering. It has command-line tools and an extensive Python API. Try it out! #cheminformatics
Getting ready for chemfp 4.1. Beta 1 out now. Can save/load similarity matrix in sparse SciPy npz format.
Added Butina clustering. How do thresholds affect the result? Pre-compute the matrix at the lowest threshold and (re)cluster at other, higher ones. Takes seconds for ChEMBL.
@rapodaca But knowing that MACCS supported ASCII (and not EBCDIC) doesn't mean that it couldn't support extended ASCII. A client with a DEV VAX and DEC vt200 terminals wouldn't have encoding issues, and it could be a practice MDL condoned.
@rapodaca Just like how VMD&other tools read and write "PDB" files which 1) *aren't* compatible with the PDB spec, and 2) *are* compatible with other PDB tools.
And again, those Latin-1 SDFile days are long-gone - modern tools expect UTF8.
@rapodaca MACCS was always designed with graphical input/display in mind, starting with the GT40. https://t.co/svix6uaVSp
The two ACS meeting citations do not appear to be findable papers - I assume they were oral presentations only.
@rapodaca Hardware requirements for MACCS-II, ca. 1990. From https://t.co/HlzFBqANSY . Note the "terminal-independent graphics library." All the terminals I looked at (VT200, Tex 4105) support an extended ASCII.
That doesn't mean CTfiles allowed non-ASCII, only that it's not precluded.
@rapodaca Thing is, "ASCII" is also a synecdoche for "text" or "human-readable". Eg, the Wikipedia entry for ASCII notes notes the 8-bit "encodings are sometimes referred to as ASCII", though are not "true ASCII".
Was MACCS 8-bit clean, with the onus on the user to handle encoding issues?
Here's the earliest (partial) published description of the (in)famous MACCS keys I've found. Pp 242-243 of Kasparek's 1990 "Computer graphics and chemical structures". https://t.co/tAg0TIDCHb
It gives 4 of the 165 (?!) keys which can be used for inverted structure search.