56 Commits

Author SHA1 Message Date
Neal Norwitz
91bde2bf99 Backport:
SF bug 753592, websucker

Pass the proper variable when the user supplies a directory.
2003-07-01 04:17:25 +00:00
Fred Drake
7b22a1a2de Handle the Content-Type header a little more appropriately: if it
contains options, drop them to get the major/minor content type.
Modified from the supplied patch to support more whitespace variation.
Closes SF patch #613605.
2002-11-12 22:21:01 +00:00
Michael W. Hudson
7742c3d40d I presume most of the fixes currently hitting the tree should go into
2.2.1, but it would be nice if people remembered to comment on their
fixes' applicability!

backport akuchling's checkin of
    revision 1.26 of webchecker.py

[Bug #512799] urllib.splittype() returns a 2-tuple.  (Reported by seb bacon)
2002-03-11 10:04:07 +00:00
Guido van Rossum
f0953b9dff Fix SF bug #482171: webchecker dies on file: URLs w/o robots.txt
The cause seems to be that when a file URL doesn't exist,
urllib.urlopen() raises OSError instead of IOError.  Simply add this
to the except clause.  Not elegant, but effective. :-)
2001-12-11 22:41:24 +00:00
Fred Drake
a2133339ff Only catch NameError and TypeError when attempting to subclass an
exception (for compatibility with old versions of Python).
2001-05-11 19:40:10 +00:00
Fred Drake
d34a9c98a9 Added more link attributes based on additonal information from Chris
McCafferty <christopher.mccafferty@csg.ch>, and a bit of experimentation
with Navigator 4.7.

HTML-as-deployed is evil!
2001-04-05 18:14:50 +00:00
Fred Drake
f3186e8242 A number of improvements based on a discussion with Chris McCafferty
<christopher.mccafferty@csg.ch>:

Add javascript: and telnet: to the types of URLs we ignore.

Add support for several additional URL-valued attributes on the BODY,
FRAME, IFRAME, LINK, OBJECT, and SCRIPT elements.
2001-04-04 17:47:25 +00:00
Guido van Rossum
f3335e193b Patch inspired by Just van Rossum: on the Mac, in savefilename(), make
the path to save a relative path by prefixing it with os.sep (':').
Also fix an indent inconsistency in the same function.
2000-04-25 21:13:24 +00:00
Guido van Rossum
918429b3b2 Moved robotparser.py to the Lib directory.
If you do a "cvs update" in the Lib directory, it will pop up there.
2000-03-29 16:02:45 +00:00
Guido van Rossum
84306246f1 Fix suggested by Magnus Kessler: in class Page, it is possible for
self.parser to be None; in that case don't dereference it in
getnames().
2000-03-28 20:10:39 +00:00
Guido van Rossum
dc8b7980e0 Skip Montanaro:
The robotparser.py module currently lives in Tools/webchecker.  In
preparation for its migration to Lib, I made the following changes:

    * renamed the test() function _test
    * corrected the URLs in _test() so they refer to actual documents
    * added an "if __name__ == '__main__'" catcher to invoke _test()
      when run as a main program
    * added doc strings for the two main methods, parse and can_fetch
    * replaced usage of regsub and regex with corresponding re code
2000-03-27 19:29:31 +00:00
Guido van Rossum
4755ee567d Complete the integration of Sam Bayer's fixes. 1999-11-17 15:41:47 +00:00
Guido van Rossum
497a19879d Changed fron importing wcnew back to webchecker. 1999-11-17 15:40:48 +00:00
Guido van Rossum
e284b21457 Integrated Sam Bayer's wcnew.py code. It seems silly to keep two
files.  Removed Sam's "SLB" change comments; otherwise this is the
same as wcnew.py.
1999-11-17 15:40:08 +00:00
Guido van Rossum
61b95db389 # *NOT* by Sam Bayer: reindented to use 4 spaces like the rest here,
# and removed trailing whitespace.
1999-11-17 15:13:21 +00:00
Guido van Rossum
64acb5ce93 Samuel L. Bayer:
- same trick with "import wcnew; webchecker = wcnew" as above
- updated readhtml() method to handle pair representation; used
  new name suppression infrastructure from wcnew.py to suppress
  processing name anchors

[And untabified --GvR]
1999-11-17 15:04:26 +00:00
Guido van Rossum
a8946406df Samuel L. Bayer:
- added -t and -a arguments
- added "import wcnew; webchecker = wcnew" in place of "import
  webchecker" (I assume that if you're happy with the changes, you'll
  just replace webchecker.py with wcnew.py, but if I were to do that,
  the diffs would be incomprehensible)
- fixed buggy -v argument (I think you got out of sync with the
  way verbosity was handled in webchecker vs. wcgui between 1.5 and
  1.5.2)
- made -v actually do something by adding a call to c.setflags()
  (probably the same problem as above)
- updated references to URLs to accommodate wcnew.py's pair
  representation; added appropriate calls to format_url() to handle
  display; added argument to ListPanel() initialization to provide
  access to format_url()

[And untabified --GvR]
1999-11-17 15:03:52 +00:00
Guido van Rossum
f97eecccb7 Samuel L. Bayer:
- same fixes from webchecker.py
- incorporated small diff between current webchecker.py and 1.5.2
- fixed bug where "extra roots" added with the -t argument were being
  checked as real roots, not just as possible continuations
- added -a argument to suppress checking of name anchors

[And untabified --GvR]
1999-11-17 15:02:53 +00:00
Guido van Rossum
dbd5c3e63b Samuel L. Bayer:
- forced new done origins to set errors if they're in self.bad (fixes
  bug where only the first of a number of errorful references to a
  link is reported under some circumstances)
- suppressed adding duplicates to self.todo list (cleans up printout
  in wcgui details)
1999-11-17 15:00:14 +00:00
Guido van Rossum
0ec1493d0b Some changes (maybe not enough?) to make it work on Windows with local
file URLs.
1999-04-26 23:11:46 +00:00
Guido van Rossum
545006259d Added Samuel Bayer's new webchecker.
Unfortunately his code breaks wcgui.py in a way that's not easy
to fix.  I expect that this is a temporary situation --
eventually Sam's changes will be merged back in.
(The changes add a -t option to specify exceptions to the -x
option, and explicit checking for #foo style fragment ids.)
1999-03-24 19:09:00 +00:00
Guido van Rossum
909bc18188 Recover from failed saves; when a file turns out to be a directory,
create a directory and moer the original file to the index.html.
1999-01-03 13:06:00 +00:00
Guido van Rossum
a42c1ee21d Added note() message to Page class -- this was used but didn't exist.
(The alternative would be to call self.checker.note() but since
self.checker might be None that's not quite right.
1998-08-06 21:31:13 +00:00
Guido van Rossum
b77a68e6b1 Rewrite to support multiple suckers, each with their own thread. 1998-07-08 03:05:22 +00:00
Guido van Rossum
125700addb Instead of printint, use self.message() or self.note(). 1998-07-08 03:04:39 +00:00