Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Treat missing repository networks, gists, and storage objects as a non-critical error #386

Merged
merged 23 commits into from May 14, 2018

Conversation

@snh
Copy link
Member

@snh snh commented Apr 1, 2018

With the recent move to the cluster backup methodology for the single instance and high availability architectures, the backup process has become less tolerant of data consistency issues.

In particular, if a repository network, gist, or storage object has a route in the database, but cannot be found on disk, the backup will abort, leaving an unusable backup snapshot.

While such errors are serious and warrant investigation with assistance from GitHub Enterprise Support, I propose that we should continue on with the remainder of the backup regardless, and present a warning regarding these routes. This can be accomplished using the --ignore-missing-args command-line argument for rsync.

To go on to detect these inconsistencies in a more parsable manner, a Verifying Routes process has been added to ghe-backup-repositories and ghe-backup-storage. This compares the routes that were received from GitHub Enterprise with the data that ended up in the backup snapshot, and reports any differences that are found as a warning, along with a recommendation to contact GitHub Enterprise Support.

As --ignore-missing-args was only introduced in rsync 3.1.0, if an earlier version is in use I instead catch exits with an exit code of 23, which is returned for a number of errors including the rsync: link_stat <file> failed: No such file or directory (2) error.

In the future I'd like to swap the use of diff out for git diff -- <file1> <file2>. This is dependent on git becoming mandatory for the backup utilities. Edit: Done.

/cc @github/backup-utils @jatoben

snh added 4 commits Mar 27, 2018
Simulates an appliance with a network, gist, and storage object
that exists in the database, but not on disk.
@lildude
Copy link
Member

@lildude lildude commented Apr 1, 2018

As --ignore-missing-args was only introduced in rsync 3.1.0, if an earlier version is in use I instead catch exits with an exit code of 23, which is returned for a number of errors including the rsync: link_stat failed: No such file or directory (2) error.

Looks like this needs work as it appears to be what is failing on the macOS build:

 * Transferring auxiliary files ...
    rsync: --ignore-missing-args: unknown option
    rsync error: syntax or usage error (code 1) at /BuildRoot/Library/Caches/com.apple.xbs/Sources/rsync/rsync-51/rsync/main.c(1337) [client=2.6.9]

In the future I'd like to swap the use of diff out for git diff -- . This is dependent on git becoming mandatory for the backup utilities.

This is already a documented requirement so you could go this route and have the normal diff as a fallback.

More comments coming in my review.

Copy link
Member

@lildude lildude left a comment

I agree we need to allow the backup to continue though I fear we will be significantly increasing the perceived backup time with the additional verification step for all backups.

How about not implementing the --ignore-missing-args flag at all (removes the version dependency), and instead catch error 23 and use it to warn of a problem. We can then put the verification step behind a debug env var (see the ghe_debug() function) so it's only run when we're trying to work out the differences?

My inline comments are moot if we go with this suggestion 😄

cat $tempdir/*.rsync | sort | uniq > $tempdir/source_routes
(cd $backup_dir/ && find * -mindepth 6 -maxdepth 7 -name \*.git -exec dirname {} \; | sort | uniq) > $tempdir/destination_routes

diff $tempdir/source_routes $tempdir/destination_routes || echo "Warning: One or more repository networks and/or gists were not found on the source appliance. Please contact GitHub Enterprise Support for assistance."

This comment has been minimized.

@lildude

lildude Apr 1, 2018
Member

Several points:

  • I think we should only show the difference output when a verbose backup is performed.

  • Do you think we should also force the backup to exit non-zero if this difference is found so it doesn't go completely under the radar? ie allow the rsync to continue but still "fail" the backup step.

  • Is this going to have a major impact on the backup time, particularly for larger datasets?

This comment has been minimized.

@snh

snh Apr 3, 2018
Author Member

I think we should only show the difference output when a verbose backup is performed.

As this information will be needed by Enterprise Support, I'd rather leave it in the standard output to prevent the need to re-run the backup with the -v flag. As this output forms part of the warning, I wouldn't consider it informational.

Do you think we should also force the backup to exit non-zero if this difference is found so it doesn't go completely under the radar? ie allow the rsync to continue but still "fail" the backup step.

This seems like a sensible approach, and would certainly add additional visibility. I'll get this added in.

Is this going to have a major impact on the backup time, particularly for larger datasets?

I don't expect it will have a significant impact, as I have tried to keep the comparison as lean and efficient as possible, but I do have a large backup archive here that I plan to use for some benchmarking.

This comment has been minimized.

@snh

snh Apr 3, 2018
Author Member

This seems like a sensible approach, and would certainly add additional visibility. I'll get this added in.

On second thoughts, this probably isn't ideal, as it leaves the snapshot marked as incomplete, which will cause it to get pruned on the next successful backup.

This comment has been minimized.

@lildude

lildude Apr 3, 2018
Member

On second thoughts, this probably isn't ideal, as it leaves the snapshot marked as incomplete, which will cause it to get pruned on the next successful backup.

Ah, good point.


# Remove --ignore-missing-args argument if using an earlier version of rsync.
if [[ $GHE_EXTRA_RSYNC_OPTS == *"--ignore-missing-args"* ]] && ! rsync -h | grep -q '\-\-ignore-missing-args'; then
GHE_EXTRA_RSYNC_OPTS=${GHE_EXTRA_RSYNC_OPTS//--ignore-missing-args/}

This comment has been minimized.

@lildude

lildude Apr 1, 2018
Member

This is going to need to be extended to strip --ignore-missing-args from the list of arguments passed in too else we hit the failure seen in the current macOS builds.

This comment has been minimized.

@snh

snh Apr 3, 2018
Author Member

Good catch, I was looking for --ignore-missing-args in the wrong place!

snh added 5 commits Apr 3, 2018
--ignore-missing-args is being passed through as a parameter, and not as
part of $GHE_EXTRA_RSYNC_OPTS.
The symlink may already be there from a previous run.
@snh
Copy link
Member Author

@snh snh commented Apr 3, 2018

This is already a documented requirement so you could go this route and have the normal diff as a fallback.

PRs such as #362 and #378 gave me the impression we were trying to avoid a hard requirement for git?

How about not implementing the --ignore-missing-args flag at all (removes the version dependency), and instead catch error 23 and use it to warn of a problem.

My concern with this approach is that exit code 23 appears to be used for a number of partial transfer due to error related error states, some of which we don't want to ignore/silence.

@lildude
Copy link
Member

@lildude lildude commented Apr 3, 2018

PRs such as #362 and #378 gave me the impression we were trying to avoid a hard requirement for git?

Yup, that's why I suggested keeping the normal diff as a fallback, but I don't mind either way.

My concern with this approach is that exit code 23 appears to be used for a number of partial transfer due to error related error states, some of which we don't want to ignore/silence.

Hmmm, but aren't we going to be doing this for those backup hosts not running a new enough version of rsync?


# Check the output for the warnings
cat "$TRASHDIR/backup-out" | grep "Warning: One or more repository networks and/or gists were not found on the source appliance."
cat "$TRASHDIR/backup-out" | grep "Warning: One or more storage objects were not found on the source appliance."

This comment has been minimized.

@lildude

lildude Apr 3, 2018
Member

Think we should test to confirm the diff output is here too?

@snh
Copy link
Member Author

@snh snh commented Apr 3, 2018

Hmmm, but aren't we going to be doing this for those backup hosts not running a new enough version of rsync?

Indeed, though I am hopeful that most users will be running a more recent version of rsync. The route verification should help safeguard us against incomplete backups too.

# Remove --ignore-missing-args parameter if using an earlier version of rsync.
if ! rsync -h | grep -q '\-\-ignore-missing-args'; then
for parameter; do
[[ ! $parameter == "--ignore-missing-args" ]] && parameters+=("$parameter") || ignore23=1

This comment has been minimized.

@snh

snh Apr 17, 2018
Author Member

Still trying to find an alternative to this.

So far on all my attempts, if I try and convert the arguments to a string so to then remove --ignore-missing-args, it is stripping the quotes out, so things like --rsync-path='sudo -u git rsync' become --rsync-path=sudo -u git rsync.

This comment has been minimized.

@lildude

lildude Apr 17, 2018
Member

For the life of me, I can't find where I've dealt with this situation in the past and can't think of a solution without resorting to array manipulation or getting into hacking brittle and confusing POSIX compliant solutions.

I think we can make an exception in this situation as we only want to remove an element from the $@ array.

That said, I'm not sure this is doing as it should be when run on hosts with support for this option:

$ rsync -h | grep '\-\-ignore-missing-args'
     --ignore-missing-args   ignore missing source args without error
$
$ bash -x ./share/github-backup-utils/ghe-rsync --rsync-path=\'sudo -u git rsync\' --ignore-missing-args
[...]
+ ignoreout='^(file has vanished: |rsync warning: some files vanished before they could be transferred)'
+ rsync -h
+ grep -q '\-\-ignore-missing-args'
+ for parameter in '"$@"'   <---- **I DONT THINK WE SHOULD BE HERE**
+ [[ ! --rsync-path='sudo == \-\-\i\g\n\o\r\e\-\m\i\s\s\i\n\g\-\a\r\g\s ]]
+ parameters+=("$parameter")
+ for parameter in '"$@"'
+ [[ ! -u == \-\-\i\g\n\o\r\e\-\m\i\s\s\i\n\g\-\a\r\g\s ]]
+ parameters+=("$parameter")
+ for parameter in '"$@"'
+ [[ ! git == \-\-\i\g\n\o\r\e\-\m\i\s\s\i\n\g\-\a\r\g\s ]]
+ parameters+=("$parameter")
+ for parameter in '"$@"'
+ [[ ! rsync' == \-\-\i\g\n\o\r\e\-\m\i\s\s\i\n\g\-\a\r\g\s ]]
+ parameters+=("$parameter")
+ for parameter in '"$@"'
+ [[ ! --ignore-missing-args == \-\-\i\g\n\o\r\e\-\m\i\s\s\i\n\g\-\a\r\g\s ]]
+ ignore23=1
+ egrep -v '^(file has vanished: |rsync warning: some files vanished before they could be transferred)'
+ rsync '--rsync-path='\''sudo' -u git 'rsync'\'''  <---- **THE ARG HAS GONE AWOL**
+ egrep -v '^(file has vanished: |rsync warning: some files vanished before they could be transferred)'
rsync: link_stat "/Users/lildude/github/backup-utils/git" failed: No such file or directory (2)
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1209) [sender=3.1.3]
+ res=23
+ '[' 23 = 24 ']'
+ '[' 23 = 23 ']'
+ '[' -n 1 ']'
+ res=0
+ exit 0
$

This comment has been minimized.

@snh

snh Apr 18, 2018
Author Member

That said, I'm not sure this is doing as it should be when run on hosts with support for this option:

I'm struggling to reproduce this, but will keep trying.

$ rsync -h | grep '\-\-ignore-missing-args'
     --ignore-missing-args   ignore missing source args without error

$ bash -x ./share/github-backup-utils/ghe-rsync --rsync-path=\'sudo -u git rsync\' --ignore-missing-args
[...]
+ ignoreout='^(file has vanished: |rsync warning: some files vanished before they could be transferred)'
+ rsync -h
+ grep -q '\-\-ignore-missing-args'
+ parameters=("$@")
+ egrep -v '^(file has vanished: |rsync warning: some files vanished before they could be transferred)'
+ rsync '--rsync-path='\''sudo' -u git 'rsync'\''' --ignore-missing-args
[...]
# Remove --ignore-missing-args parameter if using an earlier version of rsync.
if ! rsync -h | grep -q '\-\-ignore-missing-args'; then
for parameter; do
[[ ! $parameter == "--ignore-missing-args" ]] && parameters+=("$parameter") || ignore23=1

This comment has been minimized.

@lildude

lildude Apr 17, 2018
Member

For the life of me, I can't find where I've dealt with this situation in the past and can't think of a solution without resorting to array manipulation or getting into hacking brittle and confusing POSIX compliant solutions.

I think we can make an exception in this situation as we only want to remove an element from the $@ array.

That said, I'm not sure this is doing as it should be when run on hosts with support for this option:

$ rsync -h | grep '\-\-ignore-missing-args'
     --ignore-missing-args   ignore missing source args without error
$
$ bash -x ./share/github-backup-utils/ghe-rsync --rsync-path=\'sudo -u git rsync\' --ignore-missing-args
[...]
+ ignoreout='^(file has vanished: |rsync warning: some files vanished before they could be transferred)'
+ rsync -h
+ grep -q '\-\-ignore-missing-args'
+ for parameter in '"$@"'   <---- **I DONT THINK WE SHOULD BE HERE**
+ [[ ! --rsync-path='sudo == \-\-\i\g\n\o\r\e\-\m\i\s\s\i\n\g\-\a\r\g\s ]]
+ parameters+=("$parameter")
+ for parameter in '"$@"'
+ [[ ! -u == \-\-\i\g\n\o\r\e\-\m\i\s\s\i\n\g\-\a\r\g\s ]]
+ parameters+=("$parameter")
+ for parameter in '"$@"'
+ [[ ! git == \-\-\i\g\n\o\r\e\-\m\i\s\s\i\n\g\-\a\r\g\s ]]
+ parameters+=("$parameter")
+ for parameter in '"$@"'
+ [[ ! rsync' == \-\-\i\g\n\o\r\e\-\m\i\s\s\i\n\g\-\a\r\g\s ]]
+ parameters+=("$parameter")
+ for parameter in '"$@"'
+ [[ ! --ignore-missing-args == \-\-\i\g\n\o\r\e\-\m\i\s\s\i\n\g\-\a\r\g\s ]]
+ ignore23=1
+ egrep -v '^(file has vanished: |rsync warning: some files vanished before they could be transferred)'
+ rsync '--rsync-path='\''sudo' -u git 'rsync'\'''  <---- **THE ARG HAS GONE AWOL**
+ egrep -v '^(file has vanished: |rsync warning: some files vanished before they could be transferred)'
rsync: link_stat "/Users/lildude/github/backup-utils/git" failed: No such file or directory (2)
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1209) [sender=3.1.3]
+ res=23
+ '[' 23 = 24 ']'
+ '[' 23 = 23 ']'
+ '[' -n 1 ']'
+ res=0
+ exit 0
$

# Remove --ignore-missing-args parameter if using an earlier version of rsync.
if ! rsync -h | grep -q '\-\-ignore-missing-args'; then
for parameter; do

This comment has been minimized.

@lildude

lildude Apr 17, 2018
Member

Can you add the in "$@" part here too to make it clear this is what we're iterating over - it's not obviously clear unless you know about this behaviour.

snh added 9 commits Apr 18, 2018
The negative matches weren't working as was for some unknown reason.
-q appears to cause issues in some circumstances when used in
combination with pipefail.
@snh
Copy link
Member Author

@snh snh commented Apr 18, 2018

I agree we need to allow the backup to continue though I fear we will be significantly increasing the perceived backup time with the additional verification step for all backups.

Some rudimentary testing with ~120,000 repositories indicated that the verification added around 5 minutes to the backup time (the majority of which was spent doing the find).

@lildude
Copy link
Member

@lildude lildude commented Apr 24, 2018

Some rudimentary testing with ~120,000 repositories indicated that the verification added around 5 minutes to the backup time (the majority of which was spent doing the find).

What is this as a proportion of the overall backup time?

5 mins extra on a 15 min backup is significant. 5 mins extra on a 6 hour backup is barely noticeable.

@lildude
Copy link
Member

@lildude lildude commented May 11, 2018

@snh nudge ☝️

snh added 2 commits May 11, 2018
@snh
Copy link
Member Author

@snh snh commented May 11, 2018

@lildude Here are some timings from a different appliance, containing realistic end-user data:

$ find storage/ -mindepth 4 -maxdepth 4 -type f -print | sort | uniq | wc -l
129643
$ find repositories/ -mindepth 6 -maxdepth 7 -type d -name \*.git -exec dirname {} \; | sort | uniq | wc -l
14129

ghe-backup-repositories - Generating routes took 3s
ghe-backup-repositories - Fetching routes took 1s
ghe-backup-repositories - Processing routes took 0s
ghe-backup-repositories - Repo sync took 302s
ghe-backup-repositories - Special Data Directories Sync took 0s
ghe-backup-repositories - Verifying Routes took 15s
ghe-backup-repositories took 321s

ghe-backup-storage - Generating routes took 10s
ghe-backup-storage - Fetching routes took 3s
ghe-backup-storage - Processing routes took 1s
ghe-backup-storage - Storage object sync took 97s
ghe-backup-storage - Verifying Routes took 17s
ghe-backup-storage took 128s
@avary33

This comment was marked as spam.

@lildude
Copy link
Member

@lildude lildude commented May 12, 2018

Overall impact on performance for much larger instances seems quite small so I think we're good on that front.

Think we should wrap the checks in an environment variable setting so it can be disabled easily if it does prove to have a big impact on other customers?

@snh snh requested a review from lildude May 14, 2018
Copy link
Member

@lildude lildude left a comment

LGTM. Thanks for tackling this. 🙇‍♂️

@snh snh merged commit 2527b47 into master May 14, 2018
2 checks passed
2 checks passed
backup-utils Build #10535566 succeeded in 122s
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@snh snh deleted the snh/ignore-all-missing branch May 14, 2018
@terrorobe
Copy link
Contributor

@terrorobe terrorobe commented May 14, 2018

Thinking out loud - do we have any good mechanism to signal these failures to the administrator, now that the backups are allowed to complete?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

4 participants
You can’t perform that action at this time.