Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MMCif behavior when auth_seq_id is missing #775

Open
sbliven opened this issue Jun 15, 2018 · 4 comments
Open

MMCif behavior when auth_seq_id is missing #775

sbliven opened this issue Jun 15, 2018 · 4 comments
Assignees
Milestone

Comments

@sbliven
Copy link
Member

@sbliven sbliven commented Jun 15, 2018

A discussion came up in PR #774 regarding the correct behavior when parsing an mmCif file without auth_seq_id.

BioJava 4.2.11 requires the auth_seq_id column. This is a problem because it is optional according to the spec and omitted by PyMOL.

In b207d34 I added code to use the label_seq_id column for creating the ResidueNumber for each group if auth_seq_id is missing. There was some concern that this could lead to inconsistent residue numbers if some residues used '?' (defaulting to label_) while the rest used the auth_ values specified. This worry is actually not justified due to another bug, which causes a NumberFormatException if '?' is used in that column.

@josemduarte suggested only doing the label_ fallback if ALL groups have null ResidueNumbers. This is probably the right solution, but it seems like such an edge case it might not be worth the hour it will take to fix it.

sbliven added a commit to sbliven/biojava-sbliven that referenced this issue Jun 15, 2018
These tests document the current behavior. See biojava#775 for discusion
about the correct behavior.
@sbliven
Copy link
Member Author

@sbliven sbliven commented Jun 15, 2018

683132d contains tests showing that the current code (with #774) handles a missing auth_seq_id column and throws an exception for a partially populated column. Good enough?

@josemduarte
Copy link
Contributor

@josemduarte josemduarte commented Jun 15, 2018

Thanks @sbliven , very nice tests! There's one detail I was forgetting: HETATMs have a '.' for their label_seq_id in deposited PDB files.

I've added some HETATMs to your tests and they failed. I have made a fix that should handle that: josemduarte@699af8a
All tests pass with that so I guess we can keep it. I still worry that the HETATMs won't be handled correctly in the edge case where there's no auth_seq_id, but it is an edge case.

Please feel free to pull request all that together if you are happy with that.

sbliven added a commit to sbliven/biojava-sbliven that referenced this issue Jun 18, 2018
Parsing such a file again throws a NumberFormatException.
Further work/discussion of this issue is on biojava#775, but it was blocking
the merging of biojava#774.
@sbliven
Copy link
Member Author

@sbliven sbliven commented Jun 18, 2018

I think this can't be properly fixed in 4.* because we need access to the seq_id. This is stored in AminoAcidImpl.getId() and similar, but not available for Group.

I'm going to rebase these tests onto the master branch and do any fixes there. 4.* will continue to throw errors if auth_seq_id is missing or not numeric.

@sbliven sbliven self-assigned this Jun 18, 2018
@sbliven sbliven added this to the BioJava 5.1.0 milestone Jun 18, 2018
@sbliven
Copy link
Member Author

@sbliven sbliven commented Jun 18, 2018

A related comment is that auth_seq_id is an arbitrary string, according to the spec. I think that might break the current data model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants
You can’t perform that action at this time.