Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.
Sign upECOD reliability #758
ECOD reliability #758
Comments
|
Errors 2 and 3 seem to be minor and I guess what we can do is to provide better logging when they occur. Error 1 seems to be the most important with 23K cases and I think we should handle it in BioJava, since it is not really an error in ECOD definitions but their protocol in annotating structures. Could we look at the termini residues and find the closest one included in the structure to use it as the termini? |
|
There are different goals with the errors detected here.
Error 1 is more of a data model difference. Seqres don't officially have residue numbers. BioJava just sets the ResidueNumber to null. Ecod seems to interpolate these somehow (e.g. if the pdb starts at 3P then the proceeding two seqres should be 1P and 2P). BioJava is smart enough to guess the right range most of the time, so it's not really a big issue. Still, it would be nice to be able to exclude such cases from the test since the could be obscuring the more serious errors. |
Residue ranges with both insertion codes and negative residues were not being parsed correctly. These were tested individually but not together. Also fixes a bug with accepting some invalid ResidueNumber formats. These issues came up during ECOD parsing (biojava#758)
Is it? CATH seems to have daily snapshots (but their server seems slow). In case anybody is interested in CATH, I have a parser (mostly object model of their domains, each level is separate, I do not build the whole tree). |
|
@altaite I knew I should have avoided comparing classifications! Let's just say that CATH and ECOD (and SCOPe!) are all excellent classifications and the field is stronger for having some diversity. Note that BioJava has supported CATH releases for years, although it looks like we might have to do some minor refactoring to get it to pull from the daily snapshots. |
ECOD is currently the most up-to-date classification scheme. Early versions had some issues with illegal residue ranges. Have those been fixed?
We have a slow test,
EcodParseTestthat loads all the domains and does some basic sanity checks. I thought I would summarize the results on develop204 as an indicator of it's suitability for large-scale application.The script does the following:
I find the following errors. Note that this is slightly low due to a network interruption in the middle, but should be representative.
D:1-66D:1-66)Conclusions