ECOD reliability #758

sbliven · 2018-03-21T16:02:14Z

ECOD is currently the most up-to-date classification scheme. Early versions had some issues with illegal residue ranges. Have those been fixed?

We have a slow test, EcodParseTest that loads all the domains and does some basic sanity checks. I thought I would summarize the results on develop204 as an indicator of it's suitability for large-scale application.

The script does the following:

Parse the domain definition
Download the structure
Check that a residue exists for the range start and end. If not, shrinks the ranges to ordered residues

I find the following errors. Note that this is slightly low due to a network interruption in the middle, but should be representative.

Can't get range/Start doesn't exist/End does't exist: 23329
- ECOD seems to propagate definitions based on refseq. Thus, if one structure has disordered residues near the termini these may still be annotated as part of the domain. These are not concerning, but should be filtered out in the test to make sure they aren't hiding other rarer issues.
- BioJava also seems to have a bug parsing insertion codes from ECOD. This needs further BioJava attention to replicate & fix.
Empty range: 51
- These are more serious, and seem to be real ECOD errors. I looked at one in detail (e1npoA2) and it was a case where a domain had been split into three chains, so that e1nowA1[A:200-552] should have been mapped to e1npoA2[C:200-311,D:316-552]
Error parsing: 10
- 7 of these had a range of '0'. This seems to be an ECOD error. I looked at one, e5j3dF2. It might be an issue reconstructing the assembly of parent e5mmrB2, which has a very different space group.
- 1 (e1k1fD1) is a straight-up typo (D:1-66D:1-66)
- 2 have insertion codes and are BioJava parse errors

Conclusions

Automated parse tests can find errors in databases, and we (or ideally Grishin) should run these regularly & report failures
Tests are very slow, mostly due to the need to download the whole PDB. This would be a nice target for a mmtf/hadoop/spark testing framework.
Most of the major mapping bugs present in early versions have been fixed
ECOD is pretty robust now, but a couple issues still remain with sophisticated issues like assembly mapping and many-to-many mappings
BioJava also has bugs in range parsing, despite lots of unit tests in that area

lafita · 2018-03-22T11:16:02Z

Errors 2 and 3 seem to be minor and I guess what we can do is to provide better logging when they occur.

Error 1 seems to be the most important with 23K cases and I think we should handle it in BioJava, since it is not really an error in ECOD definitions but their protocol in annotating structures. Could we look at the termini residues and find the closest one included in the structure to use it as the termini?
Step 3 of the script seems to be doing exactly that: how is then Error 1 still there?

sbliven · 2018-03-22T18:58:11Z

There are different goals with the errors detected here.

Errors with the biojava parser. I've already fixed one that was contributing to some of the test failures.
Errors that ecod should fix. These we can report upstream, but should continue failing the tests.
Areas where ecod has a different data model than biojava. These should not count as test failures, although we may still want to print warnings or find work around.

Error 1 is more of a data model difference. Seqres don't officially have residue numbers. BioJava just sets the ResidueNumber to null. Ecod seems to interpolate these somehow (e.g. if the pdb starts at 3P then the proceeding two seqres should be 1P and 2P). BioJava is smart enough to guess the right range most of the time, so it's not really a big issue. Still, it would be nice to be able to exclude such cases from the test since the could be obscuring the more serious errors.


          Fix ResidueRange.parse bug

Residue ranges with both insertion codes and negative residues were not being parsed correctly. These were tested individually but not together. Also fixes a bug with accepting some invalid ResidueNumber formats. These issues came up during ECOD parsing (biojava#758)

altaite · 2018-03-22T19:10:50Z

ECOD is currently the most up-to-date classification scheme.

Is it? CATH seems to have daily snapshots (but their server seems slow). In case anybody is interested in CATH, I have a parser (mostly object model of their domains, each level is separate, I do not build the whole tree).

sbliven · 2018-03-22T20:01:41Z

@altaite I knew I should have avoided comparing classifications! Let's just say that CATH and ECOD (and SCOPe!) are all excellent classifications and the field is stronger for having some diversity.

Note that BioJava has supported CATH releases for years, although it looks like we might have to do some minor refactoring to get it to pull from the daily snapshots.

sbliven self-assigned this Mar 21, 2018

sbliven added question minor low priority testing labels Mar 21, 2018

sbliven mentioned this issue Mar 22, 2018

Refactor CathInstallation to support daily snapshots #759

Open

biojava / biojava

ECOD reliability #758

ECOD reliability #758

sbliven commented Mar 21, 2018

lafita commented Mar 22, 2018

sbliven commented Mar 22, 2018

altaite commented Mar 22, 2018 •

edited by josemduarte

sbliven commented Mar 22, 2018

biojava / biojava

Join GitHub today

ECOD reliability #758

ECOD reliability #758

Comments

sbliven commented Mar 21, 2018

Conclusions

lafita commented Mar 22, 2018

sbliven commented Mar 22, 2018

altaite commented Mar 22, 2018 • edited by josemduarte

sbliven commented Mar 22, 2018

Essential cookies

Always active

Analytics cookies

altaite commented Mar 22, 2018 •

edited by josemduarte