61 Commits

Author SHA1 Message Date
Tim Peters
182b5aca27 Whitespace normalization, via reindent.py. 2004-07-18 06:16:08 +00:00
Andrew M. Kuchling
a982c44543 [Patch #918212] Support XHTML's 'id' attribute, which can be on any element. 2004-03-21 19:07:23 +00:00
Neal Norwitz
592c4cc460 SF bug 753592, websucker bug
Pass the proper variable when the user supplies a directory.
Will backport.
2003-07-01 04:14:28 +00:00
Mark Hammond
ce56c377a0 When bad HTML is encountered, ignore the page rather than failing with
a traceback.
2003-02-27 06:59:10 +00:00
Fred Drake
0b9e3f750c Handle the Content-Type header a little more appropriately: if it
contains options, drop them to get the major/minor content type.
Modified from the supplied patch to support more whitespace variation.
Closes SF patch #613605.
2002-11-12 22:19:34 +00:00
Walter Dörwald
aaab30e00c Apply diff2.txt from SF patch http://www.python.org/sf/572113
(with one small bugfix in bgen/bgen/scantools.py)

This replaces string module functions with string methods
for the stuff in the Tools directory. Several uses of
string.letters etc. are still remaining.
2002-09-11 20:36:02 +00:00
Walter Dörwald
88a20baa77 Apply diff.txt from SF patch http://www.python.org/sf/561478
This uses cgi.parse_header() in Checker.checkforhtml(), so that
webchecker recognises the mime type text/html even if options
are specified.
2002-06-06 17:01:21 +00:00
Andrew M. Kuchling
566c0c737f [Bug #512799] urllib.splittype() returns a 2-tuple. (Reported by seb bacon) 2002-03-08 17:19:10 +00:00
Guido van Rossum
f0953b9dff Fix SF bug #482171: webchecker dies on file: URLs w/o robots.txt
The cause seems to be that when a file URL doesn't exist,
urllib.urlopen() raises OSError instead of IOError.  Simply add this
to the except clause.  Not elegant, but effective. :-)
2001-12-11 22:41:24 +00:00
Fred Drake
a2133339ff Only catch NameError and TypeError when attempting to subclass an
exception (for compatibility with old versions of Python).
2001-05-11 19:40:10 +00:00
Fred Drake
d34a9c98a9 Added more link attributes based on additonal information from Chris
McCafferty <christopher.mccafferty@csg.ch>, and a bit of experimentation
with Navigator 4.7.

HTML-as-deployed is evil!
2001-04-05 18:14:50 +00:00
Fred Drake
f3186e8242 A number of improvements based on a discussion with Chris McCafferty
<christopher.mccafferty@csg.ch>:

Add javascript: and telnet: to the types of URLs we ignore.

Add support for several additional URL-valued attributes on the BODY,
FRAME, IFRAME, LINK, OBJECT, and SCRIPT elements.
2001-04-04 17:47:25 +00:00
Guido van Rossum
f3335e193b Patch inspired by Just van Rossum: on the Mac, in savefilename(), make
the path to save a relative path by prefixing it with os.sep (':').
Also fix an indent inconsistency in the same function.
2000-04-25 21:13:24 +00:00
Guido van Rossum
918429b3b2 Moved robotparser.py to the Lib directory.
If you do a "cvs update" in the Lib directory, it will pop up there.
2000-03-29 16:02:45 +00:00
Guido van Rossum
84306246f1 Fix suggested by Magnus Kessler: in class Page, it is possible for
self.parser to be None; in that case don't dereference it in
getnames().
2000-03-28 20:10:39 +00:00
Guido van Rossum
dc8b7980e0 Skip Montanaro:
The robotparser.py module currently lives in Tools/webchecker.  In
preparation for its migration to Lib, I made the following changes:

    * renamed the test() function _test
    * corrected the URLs in _test() so they refer to actual documents
    * added an "if __name__ == '__main__'" catcher to invoke _test()
      when run as a main program
    * added doc strings for the two main methods, parse and can_fetch
    * replaced usage of regsub and regex with corresponding re code
2000-03-27 19:29:31 +00:00
Guido van Rossum
4755ee567d Complete the integration of Sam Bayer's fixes. 1999-11-17 15:41:47 +00:00
Guido van Rossum
497a19879d Changed fron importing wcnew back to webchecker. 1999-11-17 15:40:48 +00:00
Guido van Rossum
e284b21457 Integrated Sam Bayer's wcnew.py code. It seems silly to keep two
files.  Removed Sam's "SLB" change comments; otherwise this is the
same as wcnew.py.
1999-11-17 15:40:08 +00:00
Guido van Rossum
61b95db389 # *NOT* by Sam Bayer: reindented to use 4 spaces like the rest here,
# and removed trailing whitespace.
1999-11-17 15:13:21 +00:00
Guido van Rossum
64acb5ce93 Samuel L. Bayer:
- same trick with "import wcnew; webchecker = wcnew" as above
- updated readhtml() method to handle pair representation; used
  new name suppression infrastructure from wcnew.py to suppress
  processing name anchors

[And untabified --GvR]
1999-11-17 15:04:26 +00:00
Guido van Rossum
a8946406df Samuel L. Bayer:
- added -t and -a arguments
- added "import wcnew; webchecker = wcnew" in place of "import
  webchecker" (I assume that if you're happy with the changes, you'll
  just replace webchecker.py with wcnew.py, but if I were to do that,
  the diffs would be incomprehensible)
- fixed buggy -v argument (I think you got out of sync with the
  way verbosity was handled in webchecker vs. wcgui between 1.5 and
  1.5.2)
- made -v actually do something by adding a call to c.setflags()
  (probably the same problem as above)
- updated references to URLs to accommodate wcnew.py's pair
  representation; added appropriate calls to format_url() to handle
  display; added argument to ListPanel() initialization to provide
  access to format_url()

[And untabified --GvR]
1999-11-17 15:03:52 +00:00
Guido van Rossum
f97eecccb7 Samuel L. Bayer:
- same fixes from webchecker.py
- incorporated small diff between current webchecker.py and 1.5.2
- fixed bug where "extra roots" added with the -t argument were being
  checked as real roots, not just as possible continuations
- added -a argument to suppress checking of name anchors

[And untabified --GvR]
1999-11-17 15:02:53 +00:00
Guido van Rossum
dbd5c3e63b Samuel L. Bayer:
- forced new done origins to set errors if they're in self.bad (fixes
  bug where only the first of a number of errorful references to a
  link is reported under some circumstances)
- suppressed adding duplicates to self.todo list (cleans up printout
  in wcgui details)
1999-11-17 15:00:14 +00:00
Guido van Rossum
0ec1493d0b Some changes (maybe not enough?) to make it work on Windows with local
file URLs.
1999-04-26 23:11:46 +00:00