Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ECOD reliability #758

Open
sbliven opened this issue Mar 21, 2018 · 4 comments
Open

ECOD reliability #758

sbliven opened this issue Mar 21, 2018 · 4 comments

Comments

@sbliven
Copy link
Member

@sbliven sbliven commented Mar 21, 2018

ECOD is currently the most up-to-date classification scheme. Early versions had some issues with illegal residue ranges. Have those been fixed?

We have a slow test, EcodParseTest that loads all the domains and does some basic sanity checks. I thought I would summarize the results on develop204 as an indicator of it's suitability for large-scale application.

The script does the following:

  1. Parse the domain definition
  2. Download the structure
  3. Check that a residue exists for the range start and end. If not, shrinks the ranges to ordered residues

I find the following errors. Note that this is slightly low due to a network interruption in the middle, but should be representative.

  1. Can't get range/Start doesn't exist/End does't exist: 23329
    • ECOD seems to propagate definitions based on refseq. Thus, if one structure has disordered residues near the termini these may still be annotated as part of the domain. These are not concerning, but should be filtered out in the test to make sure they aren't hiding other rarer issues.
    • BioJava also seems to have a bug parsing insertion codes from ECOD. This needs further BioJava attention to replicate & fix.
  2. Empty range: 51
    • These are more serious, and seem to be real ECOD errors. I looked at one in detail (e1npoA2) and it was a case where a domain had been split into three chains, so that e1nowA1[A:200-552] should have been mapped to e1npoA2[C:200-311,D:316-552]
  3. Error parsing: 10
    • 7 of these had a range of '0'. This seems to be an ECOD error. I looked at one, e5j3dF2. It might be an issue reconstructing the assembly of parent e5mmrB2, which has a very different space group.
    • 1 (e1k1fD1) is a straight-up typo (D:1-66D:1-66)
    • 2 have insertion codes and are BioJava parse errors

Conclusions

  • Automated parse tests can find errors in databases, and we (or ideally Grishin) should run these regularly & report failures
  • Tests are very slow, mostly due to the need to download the whole PDB. This would be a nice target for a mmtf/hadoop/spark testing framework.
  • Most of the major mapping bugs present in early versions have been fixed
  • ECOD is pretty robust now, but a couple issues still remain with sophisticated issues like assembly mapping and many-to-many mappings
  • BioJava also has bugs in range parsing, despite lots of unit tests in that area
@lafita
Copy link
Member

@lafita lafita commented Mar 22, 2018

Errors 2 and 3 seem to be minor and I guess what we can do is to provide better logging when they occur.

Error 1 seems to be the most important with 23K cases and I think we should handle it in BioJava, since it is not really an error in ECOD definitions but their protocol in annotating structures. Could we look at the termini residues and find the closest one included in the structure to use it as the termini?
Step 3 of the script seems to be doing exactly that: how is then Error 1 still there?

@sbliven
Copy link
Member Author

@sbliven sbliven commented Mar 22, 2018

There are different goals with the errors detected here.

  1. Errors with the biojava parser. I've already fixed one that was contributing to some of the test failures.
  2. Errors that ecod should fix. These we can report upstream, but should continue failing the tests.
  3. Areas where ecod has a different data model than biojava. These should not count as test failures, although we may still want to print warnings or find work around.

Error 1 is more of a data model difference. Seqres don't officially have residue numbers. BioJava just sets the ResidueNumber to null. Ecod seems to interpolate these somehow (e.g. if the pdb starts at 3P then the proceeding two seqres should be 1P and 2P). BioJava is smart enough to guess the right range most of the time, so it's not really a big issue. Still, it would be nice to be able to exclude such cases from the test since the could be obscuring the more serious errors.

sbliven added a commit to sbliven/biojava-sbliven that referenced this issue Mar 22, 2018
Residue ranges with both insertion codes and negative residues
were not being parsed correctly. These were tested individually but
not together.

Also fixes a bug with accepting some invalid ResidueNumber formats.

These issues came up during ECOD parsing (biojava#758)
@altaite
Copy link
Contributor

@altaite altaite commented Mar 22, 2018

ECOD is currently the most up-to-date classification scheme.

Is it? CATH seems to have daily snapshots (but their server seems slow). In case anybody is interested in CATH, I have a parser (mostly object model of their domains, each level is separate, I do not build the whole tree).

@sbliven
Copy link
Member Author

@sbliven sbliven commented Mar 22, 2018

@altaite I knew I should have avoided comparing classifications! Let's just say that CATH and ECOD (and SCOPe!) are all excellent classifications and the field is stronger for having some diversity.

Note that BioJava has supported CATH releases for years, although it looks like we might have to do some minor refactoring to get it to pull from the daily snapshots.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants
You can’t perform that action at this time.