Stories
Slash Boxes
Comments

News for nerds, stuff that matters

Slashdot Log In

Log In

Create Account  |  Retrieve Password

Google Sheds Light On 'Dark Web' With PDF Search

Posted by Soulskill on Fri Oct 31, 2008 04:40 PM
from the are-you-afraid-of-the-dark-web dept.
CWmike writes "Google this week took another step in its effort to shed light on the so-called Dark Web, announcing that its search engine can now search scanned documents in a PDF. In April, Google announced that it was looking for ways for its search engine to index HTML forms such as drop-down boxes or select menus that otherwise couldn't be found or indexed." An announcement is available at the official Google blog, and it contains some demonstration searches.
+ -
story

Related Stories

This discussion has been archived. No new comments can be posted.
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More
Loading... please wait.
  • Increasing the number of items that can be searched is great, but the actual searching algorithms really haven't gotten THAT much better in the past 3 years or so.

    Obviously, you can't have breakthroughs every year (or maybe even every 5 years) but search as an algorithm still has much more room to improve. I'd love to see an improvement in that, as opposed to just increasing the number of pages indexed.

    Still cool though...
    • I'm still waiting for a context modifier for keywords, so when you type something like 'mechanics:teeth' you get all the technical matches for gears, and when you type 'medicine:teeth' you would get all the medical matches for dentistry.

      • by Firehed (942385) on Friday October 31 2008, @11:00PM (#25592271) Homepage

        Why not just search for "teeth medicine" then? Google hasn't done direct keyword matching only in years now (for example, a search for "computer" may yield results containing synonyms such as "PC" or "Mac" even if the original keyword of "computer" isn't contained at all on the site).

        Remember that Yahoo started out as a category browser in its very early days, and now categories are really just another keyword. Google and all of the other search engines are designed to work well for the lowest common denominator of internet users - as someone with a 3-digit UID, I imagine you're not in that group. Trying to outsmart Google will probably just make its algorithm feel unnatural/broken.

      • Have you had a look at exalead.com ? It makes good strides in this direction (even if it fails your mechanics teeth context modifier).
      • Just use http://www.clusty.com/ [clusty.com] . The search results are just as good as google, and it generates a list of categories that you can select from.

        Admittedly, "mechanical" isn't in there... The categories are quite a bit more specific, such as "baby", "shark" "wisdom", "cleaner", etc.

      • I'd settle for being able to do any kind of special character search using google, or any search engine, for that matter. When trying to look up programming related content, the lack of ability to search by special characters can be a real pain.
    • When are they going to add Gmail contents to their search results?

  • by Anonymous Coward

    Referenced article is talking about the "deep web", not dark web.

    • Never heard of either before. Looks like there's a competition going on to see who comes up with the next buzzword.

      • The Deep, Deep, Dark, Dark, Deep, Dark Web...coming soon to a web browser near you!
      • Deep web is information buried under layers that are not easily penetrable by current indexing tech.

        Dark web can either be physically separate from the internet or a virtual network that is hidden through encryption, secrecy, or both.

  • It's DEEP web, not dark. This is the internet not astrophyics.
  • Not so new? (Score:5, Interesting)

    by Archon-X (264195) on Friday October 31 2008, @06:04PM (#25590309)

    Google has long since favoured PDFs - and gives them boosted results, under the guise that anyone who makes a PDF has something serious to say, I guess.

    You may have noticed of late that people are wise to this - there are a bunch of sites that are embedding popular search terms / results in PDF files, and clustering their sites with adverts.

    • Ive noticed this. Lots of the top 5 search results for the items I usually search are PDFs. I just figured that publishing a pdf is something large organization usually does and large organizations tend to have a higher pagerank. Im not sure if PDF is in itself something that can raise a score.

    • Tesseract (Score:5, Interesting)

      by mcrbids (148650) on Friday October 31 2008, @11:21PM (#25592365) Journal

      Not so sure about PDFs as an image format - which is exactly what you have when you use PDF to hold scanned documents. I think the more interesting point is that they feel they have an OCR package good enough to be trustworthy. I wonder if it's based on the Tesseract OCR software [blogspot.com] that they adopted a while back?

      I played with it for a while, and got very poor results from the command line. Even when I made a png or bmp of a full screen single word "HELLO" in 200 pixel font with GIMP (about as perfect as input gets!) I'd often get "HEHO" or "H3H0" or god only knows what else.

      Of course, this is when the project relaunch was first announced a year or two ago, I certainly hope it's better now! Looking at their web page, it does appear that there's some significant activity [google.com] going on. Yay Google!

      Maybe I'll try it again, and see if it's worth using yet?

  • What I would really like to see is OCR for mathematical formulas, and store those in some standard format. Using a standard input, like LaTeX, the engine would search for mathematical equations. Right now I find it a pain to look for a formula that I know exists, but don't know its name.

    This would help bring together a lot of research that is done, but hard to sort through. Then, implement a smart system using a program like Mathematica to find variations of the equations, etc., and see where duplicat
    • That won't solve the problem you're having, since mathematical formulas contain arbitrary variable names.

      So if you're looking for the pythagorean formula z^2 = x^2 + y^2 say, (ie you've forgotten the name), then you'll miss the documents which contain c^2 = a^2 + b^2, etc.

      And that's the easy case, because a lot of people write z^2 = x^2 + y^2. What if for some reason your natural inclination is to type u^2 + f^2 = K^2? You'd be missing out on virtually every relevant link, because most mathematicians li

      • Actually, if the problem of recognizing a formula in text or graphics has been solved, the second part, graphing the formula, normalizing the graph and storing/retrieving graphs that meet certain criteria is quite simple.

        In other words, getting from the graphics to z^2 = x^2 + y^2 is the tough part. Once you're there, understanding that z^2 = x^2 + y^2 is equivalent to a^2 + b^2 = c^2 is easy.

        • Once you're there, understanding that z^2 = x^2 + y^2 is equivalent to a^2 + b^2 = c^2 is easy.

          What I'm saying is that's the tough part, whereas the OCR is comparatively easy (eg InftyReader [inftyproject.org]).

          You can only transform an equation if you know its meaning (ie the rules of transformation embodied by the context in which it is being written). And understanding the meaning is a hard AI problem.

          • And understanding meaning is a hard AI problem

            Not within a restricted knowledge domain. Mathematics, engineering, physics, etc. are some excellent examples of such domains.

            Been there, done that. Back when the Internet was still text based.

            • Mathematical formulas on their own are not a restricted domain. The reason is that the symbols and operations are overloaded to such an extent that you cannot, by looking only at a formula, know what it means. The surrounding context is completely necessary.

              For example, the "pythagorean" formula discussed above, z^2 = x^2 + y^2, doesn't carry any restrictions that tell you something. Are the variables numbers? In what kind of range? Are they matrices? Are they operators? Are the "2"s indices (labels) or e

              • The reason is that the symbols and operations are overloaded to such an extent that you cannot, by looking only at a formula, know what it means.

                So, how do humans read and "understand" such a formula, sitting by itself, with no surrounding context? Answer: They don't. The same holds true for machines. The following equation: z^2 = x^2 + y^2 only makes sense if the terms and notation are defined for the context, most likely in the surrounding text. Likewise, typing in the search term: c^2 = a^2 + b^2 doesn't give either a human or a machine enough to go on. In either case, there are two approaches. One, prompt the user for further constraints. Or two

                • In either case, there are two approaches. One, prompt the user for further constraints. Or two, the 'Google' response, which is to list every possible solution.

                  Precisely, and both known approaches are imho useless for the purpose of the OP, which is to type in an equation and obtain relevant documents in the case that he doesn't remember the context or wants variations.

                  The 'google' type response for formulas has high recall and very low precision. In fact, it effectively exists already for code sea

        • (See also my other comments on this thread). You're touching on issues which are problems of mathematical logic [wikipedia.org].

          If you think of say z^2 = x^2 + y^2 as an expression that belongs to a formal grammar of arithmetic (computer languages have formal grammars too), then you could use the rules of the grammar to check when two formulas are equivalent, and you could store in your database a canonical form for each such formula. This sort of thing was actually proposed by the great mathematician Hilbert a hundred y

  • Nice feature, but I think it only works with PDF? I would love to see the same with DjVu as well.
  • How about adding the word *scanned* into the headline, just as the original headline was.

    That way others won't have to read the summary going "Hey, I thought Google was searching PDFs for the last 10 years."

  • It'll be even cooler when Google are able to automatically detect things like citations and references, and add hyperlinks as appropriate.

    It still sort of bugs me that scientific papers are written in LaTeX, and not hypertext, especially considering that the web (in its current form) originated at CERN.

  • There's a module in CPAN [cpan.org] for this. It rips out the images and runs them through Tesseract. [sourceforge.net] It's worked well the few times I've tried it. Certainly well enough for search engine indexing.

    Also, my understanding of the "dark web" concept was that it refered to sites that had no links going to them, so no spiders are able to access them. I'm not seeing how any of this would fix the "problem".

    The only news here is that Google doesn't already index form content in drop down boxes and selection menus. S

    • by denmarkw00t (892627) <megsuma@@@gmail...com> on Friday October 31 2008, @04:58PM (#25589719) Homepage Journal
      I think you've got this wrong, to some extent. I don't think its going to "submit" to see what options go where, but more just indexing the options from forms to give a better idea of whats going on in the page - suddenly google can go "Hey, this isn't just a form, but its a form pertaining to X." and thus make their results more relevant by being able to index more of a site as a whole.
      • You're mistaken.

        "For text boxes, our computers automatically choose words from the site that has the form; for select menus, check boxes and radio buttons on the form, we choose from among the values of the HTML," they noted in a blog post. "Having chosen the values for each input, we generate and then try to crawl URLs that correspond to a possible query a user may have made. If we ascertain that the Web page resulting from our query is valid, interesting and includes content not in our index, we may inc
        • I suspect they'd only submit a form if the method is "get" rather than "post"... which technically is okay, although in practice it will likely upset some websites!
    • by Reckless Visionary (323969) * on Friday October 31 2008, @04:58PM (#25589721)

      If people want their sites to be indexed, they shouldn't use forms for navigation. It's not rocket science.

      This isn't about people who want their sites indexed. It's about sites that Google wants to index, but which aren't designed to be indexed. If you prefer not to be indexed, Google says they will abide by robots.txt.

    • by spitzak (4019) on Friday October 31 2008, @05:03PM (#25589783) Homepage

      I think it is just going to look in the contents of the controls. This would be really useful, for instance if you search for "Widget Model XJ123" it will now find a page by a manufacturer where the only place they list it is in a pulldown list that lets you choose the product to buy.

      • by Arthur Grumbine (1086397) on Friday October 31 2008, @05:50PM (#25590191) Homepage Journal

        for instance if you search for "Widget Model XJ123" it will now find a page by a manufacturer where the only place they list it is in a pulldown list that lets you choose the product to buy.

        Shenanigans! [google.com] And I've been looking everywhere for that elusive XJ123, since the manufacturer stopped producing it. How dare you get my hopes up!

        • by AVryhof (142320) <avryhofNO@SPAMgawab.com> on Saturday November 01 2008, @06:47AM (#25593915) Homepage

          for instance if you search for "Widget Model XJ123" it will now find a page by a manufacturer where the only place they list it is in a pulldown list that lets you choose the product to buy.

          Shenanigans! [google.com] And I've been looking everywhere for that elusive XJ123, since the manufacturer stopped producing it. How dare you get my hopes up you insensitive clod!

          There. Fixed that for you.

          • Is it funny or weird or normal that the only hit on Google for "Widget Model XJ123" is this thread?
      • What if I don't want to buy a widget? I'd like to see a Google filter which hides all the product pages from its listing. As I see it, those kinds of pages are just spam. Who wants to buy the same product from a zillion different places all over the web?

        It might actually be useful for a search engine to read the product name in a pulldown as a simple indicator that the page should be penalized as content free. I would probably pay to use that kind of search engine.

    • Re: (Score:3, Interesting)

      Well, if it's a form with a GET request then it should be safe to request it, and it's used merely to display some information. Forms using the POST method, which performs an action, are less safe and I'd hope Google is not trying to spider those.

      If people want their sites to be indexed, they shouldn't use forms for navigation.

      So the alternative is automatically generating pages and pages of links to every possible item in the database just so that search engines can follow them? If a form is the most nat

      • That's under the completely unsafe assumption that forms are being used properly. There have been numerous instances of people putting full SQL queries (with DB connection data) in a GET form - see TheDailyWTF.

        Though I suppose that's a bad example, as it would be really damn easy for Google to index THOSE sites. Just swap in a SELECT * and you're all set :)

    • It looks like they will only use GET requests, not POST requests. You may have trouble if you use GET requests to make changes on your site (which nearly everybody with minimal experience knows you should never do).

    • by fiannaFailMan (702447) on Friday October 31 2008, @05:20PM (#25589931) Journal

      "Scanning is the reverse of printing." -- WTF?! Because of artifacts?

      And isn't this what View as HTML has ALWAYS been about?

      Points awarded for techtard clarity, but the person at Google who thought writing a press release aimed at techtards should be firmly smacked.

      Calm down please. The guy is trying to explain the concept to a broader audience, or 'techtards' as you so pompously refer to them along with your out-of-context quote, and he's doing a fine job of explaining how it is hard for a computer to interpret scanned text. The days are gone when the web was the preserve of nerds with zero social skills. Get over it.

    • by PotatoFarmer (1250696) on Friday October 31 2008, @05:27PM (#25589983)
      I'm not sure if you got the point of this - it's about using a form of OCR to translate embedded document images within a PDF, rather than simply sucking the text out of the PDF itself, as you rightly point out is already available in the View as HTML option for PDF search results.

      Scanning is the reverse of printing because, well, it's the reverse of printing. When you're scanning something, you're taking a purely human-readable document and translating its contents into a machine interpretable form. This is pretty much the exact opposite of printing from a computer.
    • Actually I switched back to Yahoo search from Google search and find its become pretty damned good. Especially the little "more" tab,which when pulled on,say "Dead Space",it'll give me reviews,codes,walkthroughs,etc. Compare than to the "more" in Google which gives me crap like Google blogs. If you haven't tried it lately they have really gotten a lot better. I guess what they really needed was the fear of MSFT put in them.
    • I thought you were being cynical, but then I found http://www.fart-joke.com/ [fart-joke.com] . Ah, well, all good things must come to an end.

      How ironic that the uselessness of the web as a serious communications tool should be discussed on the web.