MMCif behavior when auth_seq_id is missing #775

sbliven · 2018-06-15T10:23:10Z

A discussion came up in PR #774 regarding the correct behavior when parsing an mmCif file without auth_seq_id.

BioJava 4.2.11 requires the auth_seq_id column. This is a problem because it is optional according to the spec and omitted by PyMOL.

In b207d34 I added code to use the label_seq_id column for creating the ResidueNumber for each group if auth_seq_id is missing. There was some concern that this could lead to inconsistent residue numbers if some residues used '?' (defaulting to label_) while the rest used the auth_ values specified. This worry is actually not justified due to another bug, which causes a NumberFormatException if '?' is used in that column.

@josemduarte suggested only doing the label_ fallback if ALL groups have null ResidueNumbers. This is probably the right solution, but it seems like such an edge case it might not be worth the hour it will take to fix it.


          Add tests for missing auth_seq_id in mmcif files

These tests document the current behavior. See biojava#775 for discusion about the correct behavior.

sbliven · 2018-06-15T10:30:01Z

683132d contains tests showing that the current code (with #774) handles a missing auth_seq_id column and throws an exception for a partially populated column. Good enough?

josemduarte · 2018-06-15T18:13:31Z

Thanks @sbliven , very nice tests! There's one detail I was forgetting: HETATMs have a '.' for their label_seq_id in deposited PDB files.

I've added some HETATMs to your tests and they failed. I have made a fix that should handle that: josemduarte@699af8a
All tests pass with that so I guess we can keep it. I still worry that the HETATMs won't be handled correctly in the edge case where there's no auth_seq_id, but it is an edge case.

Please feel free to pull request all that together if you are happy with that.


          Revert fix for mmcif files without auth_seq_id column

Parsing such a file again throws a NumberFormatException. Further work/discussion of this issue is on biojava#775, but it was blocking the merging of biojava#774.

sbliven · 2018-06-18T10:01:06Z

I think this can't be properly fixed in 4.* because we need access to the seq_id. This is stored in AminoAcidImpl.getId() and similar, but not available for Group.

I'm going to rebase these tests onto the master branch and do any fixes there. 4.* will continue to throw errors if auth_seq_id is missing or not numeric.

sbliven · 2018-06-18T10:03:18Z

A related comment is that auth_seq_id is an arbitrary string, according to the spec. I think that might break the current data model.

sbliven added a commit to sbliven/biojava-sbliven that referenced this issue Jun 15, 2018

Add tests for missing auth_seq_id in mmcif files

Loading status checks…

683132d

These tests document the current behavior. See biojava#775 for discusion about the correct behavior.

sbliven mentioned this issue Jun 15, 2018

Fix #703: Recover from empty structure files in PDB_CACHE_DIR #774

Merged

sbliven self-assigned this Jun 18, 2018

sbliven added this to the BioJava 5.1.0 milestone Jun 18, 2018

josemduarte modified the milestones: BioJava 5.2.0, BioJava 6.0.0 Sep 2, 2019

biojava / biojava

MMCif behavior when auth_seq_id is missing #775

MMCif behavior when auth_seq_id is missing #775

sbliven commented Jun 15, 2018 •

edited

sbliven commented Jun 15, 2018

josemduarte commented Jun 15, 2018

sbliven commented Jun 18, 2018

sbliven commented Jun 18, 2018

biojava / biojava

Join GitHub today

MMCif behavior when auth_seq_id is missing #775

MMCif behavior when auth_seq_id is missing #775

Comments

sbliven commented Jun 15, 2018 • edited

sbliven commented Jun 15, 2018

josemduarte commented Jun 15, 2018

sbliven commented Jun 18, 2018

sbliven commented Jun 18, 2018

Essential cookies

Always active

Analytics cookies

sbliven commented Jun 15, 2018 •

edited