User:GreenC/WaybackMedic 2

Sign first aid.svg

Wayback Medic is a bot that fixes problems with links to Internet Archive Wayback Machine (and some WebCite and others). Mostly it will be Fix #2 below.

The difference between WaybackMedic 1 & 2:

  • WM2 processes all articles containing Wayback links (about 380k). WM1 processed a subset of articles (about 140k).
  • WM2 will do fix #2 on all links. WM1 only did fix #2 if another fix was done on the same URL at the same time.

The bot operator is User:Green Cardamom please leave problem notices on my talk page.


WM fixes
WaybackMedic Fixes
Fix number Function name Example edit Description Notes
1 fixthespuriousone Example Remove spurious |1= in cite templates.
2 fixmissingprotocol Example 1. Add https if protocol missing from the archive.org URL.
2. Convert existing protocol http to https.
3. Add second-level domain web if missing (archive.org/web/ → web.archive.org/web/)
4. Add "/web" path (web.archive.org/2016/ → web.archive.org/web/2016/)
5. Remove ":80" (eg. https://web.archive.org/web/2016/http://example.com:80/). Port 80 is added by the API and not needed. Non-80's are retained.
HTTPS per RFC
3 fixemptyarchive Example 1. If |archiveurl= is empty, remove |archiveurl= and |archivedate= and add {{dead link}}.
2. If |archiveurl= is empty but the |url= is working then leave alone.
3. If |archivedate= is empty or missing but |archiveurl= has content, generate value for |archivedate= based on snapshot date in the archive URL.
4 fixbadstatus Example Check all Wayback Machine URLs for response code errors (anything but 200s). If an error code, try for a better URL via the Wayback API - first using accessdate, then using the earliest date available. If still none found, remove |archiveurl= and |archivedate= and add {{dead link}}.
5 fixtrailingchar Example In some cases the URL erroneously trails a "." or "," or ":" or "-"
6 fixemptywayback Example The wayback template is mangled in a certain way. Action: re-assemble. It won't delete multiple instances if they exist in the same ref (as in the Example).
7 fixencodedurl Example The URL was incorrectly encoded. Fully decode URL and re-encode.
8 fixdatemismatch Example 1. Ensure |archivedate= matches the snapshot date in the URL
2. Ensure date format matches dmy or mdy if set (retain ymd if in use)
WM examines
  • {{wayback}} templates inside and outside ref pairs.
  • Citation templates inside and outside ref pairs.
  • Bare URLs outside templates. If these return 404 etc replace with the regular URL.
WM design
  • Multiple HTTP header status code checks at application layer. Verifies Wayback URLs.
  • Time outs & retries built-in to the web transfer agent settings (wget)
  • Multiple checks of the Wayback API using multiple dates to ensure a page really is unavailable.
  • Re-checks the API results by looking at the header to ensure it really is a good page.
  • If IA returns a 404 Bummer. The machine that serves this file is down. -- treat it as a code 200 and leave the link alone.
  • If no Wayback available, checks Memento for alternative archives such as Library of Congress, WebCite and a few dozen others.

StatisticsEdit

August 20, 2016

WaybackMedic 2 checked ~400k articles containing Wayback links on en.Wikipedia - this is the full corpus of every article containing Wayback links in the system as of August 20, 2016. It found about 1.1 million Wayback links. Of those about ~15,000 links were not working. WM2 was able to fix 3,786 by finding a new snapshot date, and 680 by finding an alternative archive service - the rest 11,125 were deleted from Wikipedia. Of those, about 7,000 were due to robots.txt - the rest were 301/302 (infinite loop), 303, 400, 401, 403 (non-robots.txt), 406, 409, 415, 500, 502, 504 and 521.

WaybackMedic 2 edited 203,902 articles in total. Most of it was fix #2 (439,170 changes) which mostly amounted to changing http->https and adding "/web/" to the URL. It uncovered a large number of "Log dead URL" and "Log emptyarch" fix #3, as well as a very large number of "Log date mismatch" fix #8.

WaybackMedic Stats
Type Number Description
Bummer 497 Wayback links that return "Bummer page not found"
API mismatch 21280 Wayback API returned fewer records than sent
Bogusapi 14625 Wayback API-returned links that don't match real status code
JSON mismatch 20575 Wayback API returned different size JSON
Discovered 203902 Number of articles edited by WaybackMedic
Log 404 16194 Dead wayback links
Log emptyarch 1065 Empty archiveurl arguments
Log emptyway 1 Ref has an empty {{wayback}}
Log encode 3 URL misencoded
Log spurious 1 131 Spurious |1= parameter
Log trail 15 URL has a trailing bad character
Log dead URL 2338 |url= is dead even though |deadurl=no, |archiveurl=dead and missing {{dead}}
Log skindeep 439170 Changes to URL are "skindeep" ie. format of URL changed to match https://web.archive.org/web/...
Log date mismatch 24473 Date in archive URL doesn't match archivedate argument in cite template
New alt archive 680 Replaced with archive URL found at Mementoweb.org
New IA date 3786 Changed snapshot date
Wayback RM 11125 Wayback link deleted
Wayback All 1090289 Wayback links total found

LinksEdit