Please create an account to participate in the Slashdot moderation system

 



Forgot your password?
typodupeerror
×
Google Businesses The Internet News Technology

Google Sheds Light On 'Dark Web' With PDF Search 78

CWmike writes "Google this week took another step in its effort to shed light on the so-called Dark Web, announcing that its search engine can now search scanned documents in a PDF. In April, Google announced that it was looking for ways for its search engine to index HTML forms such as drop-down boxes or select menus that otherwise couldn't be found or indexed." An announcement is available at the official Google blog, and it contains some demonstration searches.
This discussion has been archived. No new comments can be posted.

Google Sheds Light On 'Dark Web' With PDF Search

Comments Filter:
  • Increasing the number of items that can be searched is great, but the actual searching algorithms really haven't gotten THAT much better in the past 3 years or so.

    Obviously, you can't have breakthroughs every year (or maybe even every 5 years) but search as an algorithm still has much more room to improve. I'd love to see an improvement in that, as opposed to just increasing the number of pages indexed.

    Still cool though...
    • by mikael ( 484 )

      I'm still waiting for a context modifier for keywords, so when you type something like 'mechanics:teeth' you get all the technical matches for gears, and when you type 'medicine:teeth' you would get all the medical matches for dentistry.

      • by Firehed ( 942385 ) on Friday October 31, 2008 @11:00PM (#25592271) Homepage

        Why not just search for "teeth medicine" then? Google hasn't done direct keyword matching only in years now (for example, a search for "computer" may yield results containing synonyms such as "PC" or "Mac" even if the original keyword of "computer" isn't contained at all on the site).

        Remember that Yahoo started out as a category browser in its very early days, and now categories are really just another keyword. Google and all of the other search engines are designed to work well for the lowest common denominator of internet users - as someone with a 3-digit UID, I imagine you're not in that group. Trying to outsmart Google will probably just make its algorithm feel unnatural/broken.

      • Have you had a look at exalead.com ? It makes good strides in this direction (even if it fails your mechanics teeth context modifier).
      • Just use http://www.clusty.com/ [clusty.com] . The search results are just as good as google, and it generates a list of categories that you can select from.

        Admittedly, "mechanical" isn't in there... The categories are quite a bit more specific, such as "baby", "shark" "wisdom", "cleaner", etc.

      • I'd settle for being able to do any kind of special character search using google, or any search engine, for that matter. When trying to look up programming related content, the lack of ability to search by special characters can be a real pain.
    • by Dan541 ( 1032000 )

      When are they going to add Gmail contents to their search results?

  • by Anonymous Coward

    Referenced article is talking about the "deep web", not dark web.

    • Never heard of either before. Looks like there's a competition going on to see who comes up with the next buzzword.

      • The Deep, Deep, Dark, Dark, Deep, Dark Web...coming soon to a web browser near you!
      • Deep web is information buried under layers that are not easily penetrable by current indexing tech.

        Dark web can either be physically separate from the internet or a virtual network that is hidden through encryption, secrecy, or both.

      • "Baby shark wisdom cleaner" 2.0 ?
  • It's DEEP web, not dark. This is the internet not astrophyics.
  • I just started reading and it says "powerful search engines such as Google and Yahoo". Yahoo is a search engine? A Powerful one? It's an advertising index, Spam search, Ad finder? I call BS, no one thinks Yahoo is a powerful search engine!
  • Every time I use image search and see most are not related, I look at Google asking ME to help them label pictures to help. I feel guilty for not helping, and comfort myself knowing Google has a far better shot at image recognition than I ever will.
  • Not so new? (Score:5, Interesting)

    by Archon-X ( 264195 ) on Friday October 31, 2008 @06:04PM (#25590309)

    Google has long since favoured PDFs - and gives them boosted results, under the guise that anyone who makes a PDF has something serious to say, I guess.

    You may have noticed of late that people are wise to this - there are a bunch of sites that are embedding popular search terms / results in PDF files, and clustering their sites with adverts.

    • Ive noticed this. Lots of the top 5 search results for the items I usually search are PDFs. I just figured that publishing a pdf is something large organization usually does and large organizations tend to have a higher pagerank. Im not sure if PDF is in itself something that can raise a score.

    • The new part is now Google can index PDFs that have no text, only embedded images, via OCR. These are pretty common as a way of posting scanned multi-page documents online; for example many older academic papers are posted this way. Google Scholar should become more useful due to this.

    • Tesseract (Score:5, Interesting)

      by mcrbids ( 148650 ) on Friday October 31, 2008 @11:21PM (#25592365) Journal

      Not so sure about PDFs as an image format - which is exactly what you have when you use PDF to hold scanned documents. I think the more interesting point is that they feel they have an OCR package good enough to be trustworthy. I wonder if it's based on the Tesseract OCR software [blogspot.com] that they adopted a while back?

      I played with it for a while, and got very poor results from the command line. Even when I made a png or bmp of a full screen single word "HELLO" in 200 pixel font with GIMP (about as perfect as input gets!) I'd often get "HEHO" or "H3H0" or god only knows what else.

      Of course, this is when the project relaunch was first announced a year or two ago, I certainly hope it's better now! Looking at their web page, it does appear that there's some significant activity [google.com] going on. Yay Google!

      Maybe I'll try it again, and see if it's worth using yet?

  • More to it (Score:2, Insightful)

    by spud.dups ( 1371655 )
    What I would really like to see is OCR for mathematical formulas, and store those in some standard format. Using a standard input, like LaTeX, the engine would search for mathematical equations. Right now I find it a pain to look for a formula that I know exists, but don't know its name.

    This would help bring together a lot of research that is done, but hard to sort through. Then, implement a smart system using a program like Mathematica to find variations of the equations, etc., and see where duplicat
    • That won't solve the problem you're having, since mathematical formulas contain arbitrary variable names.

      So if you're looking for the pythagorean formula z^2 = x^2 + y^2 say, (ie you've forgotten the name), then you'll miss the documents which contain c^2 = a^2 + b^2, etc.

      And that's the easy case, because a lot of people write z^2 = x^2 + y^2. What if for some reason your natural inclination is to type u^2 + f^2 = K^2? You'd be missing out on virtually every relevant link, because most mathematicians li

      • by PPH ( 736903 )

        Actually, if the problem of recognizing a formula in text or graphics has been solved, the second part, graphing the formula, normalizing the graph and storing/retrieving graphs that meet certain criteria is quite simple.

        In other words, getting from the graphics to z^2 = x^2 + y^2 is the tough part. Once you're there, understanding that z^2 = x^2 + y^2 is equivalent to a^2 + b^2 = c^2 is easy.

        • Once you're there, understanding that z^2 = x^2 + y^2 is equivalent to a^2 + b^2 = c^2 is easy.

          What I'm saying is that's the tough part, whereas the OCR is comparatively easy (eg InftyReader [inftyproject.org]).

          You can only transform an equation if you know its meaning (ie the rules of transformation embodied by the context in which it is being written). And understanding the meaning is a hard AI problem.

          • by PPH ( 736903 )

            And understanding meaning is a hard AI problem

            Not within a restricted knowledge domain. Mathematics, engineering, physics, etc. are some excellent examples of such domains.

            Been there, done that. Back when the Internet was still text based.

            • Mathematical formulas on their own are not a restricted domain. The reason is that the symbols and operations are overloaded to such an extent that you cannot, by looking only at a formula, know what it means. The surrounding context is completely necessary.

              For example, the "pythagorean" formula discussed above, z^2 = x^2 + y^2, doesn't carry any restrictions that tell you something. Are the variables numbers? In what kind of range? Are they matrices? Are they operators? Are the "2"s indices (labels) or e

              • by PPH ( 736903 )

                The reason is that the symbols and operations are overloaded to such an extent that you cannot, by looking only at a formula, know what it means.

                So, how do humans read and "understand" such a formula, sitting by itself, with no surrounding context? Answer: They don't. The same holds true for machines. The following equation: z^2 = x^2 + y^2 only makes sense if the terms and notation are defined for the context, most likely in the surrounding text. Likewise, typing in the search term: c^2 = a^2 + b^2 doesn't give either a human or a machine enough to go on. In either case, there are two approaches. One, prompt the user for further constraints. Or two

                • In either case, there are two approaches. One, prompt the user for further constraints. Or two, the 'Google' response, which is to list every possible solution.

                  Precisely, and both known approaches are imho useless for the purpose of the OP, which is to type in an equation and obtain relevant documents in the case that he doesn't remember the context or wants variations.

                  The 'google' type response for formulas has high recall and very low precision. In fact, it effectively exists already for code sea

      • You have a good point. If the program could determine which values are undefined, and what the defined portions of the problem are, then I think I have a solution. It would be similar to what happens to your program code as it's being compiled. The compiler doesn't care what the actual variable is, just if that variable is the same as another.

        For your solution, the database entry would be something like this:

        (arbitrary value 1)^2 = (arbitrary value 2)^2 + (arbitrary value 3)^2, (arbitrary value 1)!
        • (See also my other comments on this thread). You're touching on issues which are problems of mathematical logic [wikipedia.org].

          If you think of say z^2 = x^2 + y^2 as an expression that belongs to a formal grammar of arithmetic (computer languages have formal grammars too), then you could use the rules of the grammar to check when two formulas are equivalent, and you could store in your database a canonical form for each such formula. This sort of thing was actually proposed by the great mathematician Hilbert a hundred y

  • Nice feature, but I think it only works with PDF? I would love to see the same with DjVu as well.
  • How about adding the word *scanned* into the headline, just as the original headline was.

    That way others won't have to read the summary going "Hey, I thought Google was searching PDFs for the last 10 years."

  • dark web.. oh geez. eternal September has only just started.
    aparently the world at large loves to shit on standards and practices.
    it's been a while since search engines actually returned results I was looking for. google, yahoo, msn, metacrawler,.. they all want my money. "-com" + adblock doesn't really help anymore. I'm so sick and tired of the net. it once was the best thing that ever happened to the world. now it's the hyper-communication tool for fart jokes and perversion.
    guess that tells you a lot a
    • I thought you were being cynical, but then I found http://www.fart-joke.com/ [fart-joke.com] . Ah, well, all good things must come to an end.

      How ironic that the uselessness of the web as a serious communications tool should be discussed on the web.

      • by eltaco ( 1311561 )
        just don't google bukkake or hentai.
        but, yeah, I get your point, there are still safe havens. granted. /. being one of the very few.
        you know what I like about genmay.net or somethingawful.com? they once spearheaded the development that the net is now witness to. I visited them regularly for my local and esoteric laugh.
        but then that shit hit mainstream - it was just the logical conclusion to the net. now "tits or gtfo" is common - same as 1337 once was a marker, now it's public and even grounds for bemus
  • It'll be even cooler when Google are able to automatically detect things like citations and references, and add hyperlinks as appropriate.

    It still sort of bugs me that scientific papers are written in LaTeX, and not hypertext, especially considering that the web (in its current form) originated at CERN.

  • There's a module in CPAN [cpan.org] for this. It rips out the images and runs them through Tesseract. [sourceforge.net] It's worked well the few times I've tried it. Certainly well enough for search engine indexing.

    Also, my understanding of the "dark web" concept was that it refered to sites that had no links going to them, so no spiders are able to access them. I'm not seeing how any of this would fix the "problem".

    The only news here is that Google doesn't already index form content in drop down boxes and selection menus. S

  • Very soon they will start evaluating javascript too, that will shed more light on the dark internet.

    Some kid's blog will have a new entry "How did I crash Google?"

"If value corrupts then absolute value corrupts absolutely."

Working...