Google Sheds Light On 'Dark Web' With PDF Search 78
CWmike writes "Google this week took another step in its effort to shed light on the so-called Dark Web, announcing that its search engine can now search scanned documents in a PDF. In April, Google announced that it was looking for ways for its search engine to index HTML forms such as drop-down boxes or select menus that otherwise couldn't be found or indexed."
An announcement is available at the official Google blog, and it contains some demonstration searches.
Re: (Score:1)
Re:1000 years of darkness coming to an end? (Score:5, Funny)
After reading that, I've come to the conclusion that some parts of the internet should definitely remain in the dark.
Re: (Score:2, Informative)
Just look at
Interesting that their example didn't work! (Score:2)
But in their Repairing Aluminum Wiring [google.com] example, the PDF reads:
and the Google HTML reads:
Re:Just what we needed (Score:4, Informative)
Re: (Score:2)
"For text boxes, our computers automatically choose words from the site that has the form; for select menus, check boxes and radio buttons on the form, we choose from among the values of the HTML," they noted in a blog post. "Having chosen the values for each input, we generate and then try to crawl URLs that correspond to a possible query a user may have made. If we ascertain that the Web page resulting from our query is valid, interesting and includes content not in our index, we may inc
Re: (Score:2)
Re:Just what we needed (Score:5, Informative)
If people want their sites to be indexed, they shouldn't use forms for navigation. It's not rocket science.
This isn't about people who want their sites indexed. It's about sites that Google wants to index, but which aren't designed to be indexed. If you prefer not to be indexed, Google says they will abide by robots.txt.
Re:Just what we needed (Score:5, Insightful)
I think it is just going to look in the contents of the controls. This would be really useful, for instance if you search for "Widget Model XJ123" it will now find a page by a manufacturer where the only place they list it is in a pulldown list that lets you choose the product to buy.
Re:Just what we needed (Score:4, Funny)
for instance if you search for "Widget Model XJ123" it will now find a page by a manufacturer where the only place they list it is in a pulldown list that lets you choose the product to buy.
Shenanigans! [google.com] And I've been looking everywhere for that elusive XJ123, since the manufacturer stopped producing it. How dare you get my hopes up!
Re:Just what we needed (Score:4, Funny)
for instance if you search for "Widget Model XJ123" it will now find a page by a manufacturer where the only place they list it is in a pulldown list that lets you choose the product to buy.
Shenanigans! [google.com] And I've been looking everywhere for that elusive XJ123, since the manufacturer stopped producing it. How dare you get my hopes up you insensitive clod!
There. Fixed that for you.
Re: (Score:2)
Re: (Score:1)
When in doubt, remove possibly extraneous search terms. I had to dig, but I found an xj123 model [iloveswimwear.com]...
Re: (Score:2)
It might actually be useful for a search engine to read the product name in a pulldown as a simple indicator that the page should be penalized as content free. I would probably pay to use that kind of search engine.
Re: (Score:1)
What if I don't want to buy a widget? I'd like to see a Google filter which hides all the product pages from its listing.
Which, incidentally, would probably also boost googles ad business since they would no longer be providing free advertising. Sales in the sponsored links, info in the search results, sounds good to me.
Re: (Score:3, Interesting)
Well, if it's a form with a GET request then it should be safe to request it, and it's used merely to display some information. Forms using the POST method, which performs an action, are less safe and I'd hope Google is not trying to spider those.
So the alternative is automatically generating pages and pages of links to every possible item in the database just so that search engines can follow them? If a form is the most nat
Re: (Score:2)
That's under the completely unsafe assumption that forms are being used properly. There have been numerous instances of people putting full SQL queries (with DB connection data) in a GET form - see TheDailyWTF.
Though I suppose that's a bad example, as it would be really damn easy for Google to index THOSE sites. Just swap in a SELECT * and you're all set :)
Re: (Score:2)
It looks like they will only use GET requests, not POST requests. You may have trouble if you use GET requests to make changes on your site (which nearly everybody with minimal experience knows you should never do).
Cool, and definitely worthwhile, but... (Score:2)
Obviously, you can't have breakthroughs every year (or maybe even every 5 years) but search as an algorithm still has much more room to improve. I'd love to see an improvement in that, as opposed to just increasing the number of pages indexed.
Still cool though...
Re: (Score:2)
I'm still waiting for a context modifier for keywords, so when you type something like 'mechanics:teeth' you get all the technical matches for gears, and when you type 'medicine:teeth' you would get all the medical matches for dentistry.
Re:Cool, and definitely worthwhile, but... (Score:5, Insightful)
Why not just search for "teeth medicine" then? Google hasn't done direct keyword matching only in years now (for example, a search for "computer" may yield results containing synonyms such as "PC" or "Mac" even if the original keyword of "computer" isn't contained at all on the site).
Remember that Yahoo started out as a category browser in its very early days, and now categories are really just another keyword. Google and all of the other search engines are designed to work well for the lowest common denominator of internet users - as someone with a 3-digit UID, I imagine you're not in that group. Trying to outsmart Google will probably just make its algorithm feel unnatural/broken.
Re: (Score:2)
Re: (Score:2)
Just use http://www.clusty.com/ [clusty.com] . The search results are just as good as google, and it generates a list of categories that you can select from.
Admittedly, "mechanical" isn't in there... The categories are quite a bit more specific, such as "baby", "shark" "wisdom", "cleaner", etc.
Re: (Score:2)
Re: (Score:2)
When are they going to add Gmail contents to their search results?
Dark web? Deep Web! (Score:2, Insightful)
Referenced article is talking about the "deep web", not dark web.
Re: (Score:2)
Never heard of either before. Looks like there's a competition going on to see who comes up with the next buzzword.
Re: (Score:2)
Re: (Score:2)
Deep web is information buried under layers that are not easily penetrable by current indexing tech.
Dark web can either be physically separate from the internet or a virtual network that is hidden through encryption, secrecy, or both.
Re: (Score:1)
Re:'Scanning is the reverse of printing.' (Score:5, Informative)
"Scanning is the reverse of printing." -- WTF?! Because of artifacts?
And isn't this what View as HTML has ALWAYS been about?
Points awarded for techtard clarity, but the person at Google who thought writing a press release aimed at techtards should be firmly smacked.
Calm down please. The guy is trying to explain the concept to a broader audience, or 'techtards' as you so pompously refer to them along with your out-of-context quote, and he's doing a fine job of explaining how it is hard for a computer to interpret scanned text. The days are gone when the web was the preserve of nerds with zero social skills. Get over it.
Re:'Scanning is the reverse of printing.' (Score:5, Informative)
Scanning is the reverse of printing because, well, it's the reverse of printing. When you're scanning something, you're taking a purely human-readable document and translating its contents into a machine interpretable form. This is pretty much the exact opposite of printing from a computer.
Re: (Score:1)
Re: (Score:1)
images of text, not images of things. To obtain text from a photograph of a person, or a painting, is beyond even Google at the moment...
BTW: I wish Adobe used this OCR, so search works on a pdf of scanned text.
small nit-pick (Score:2)
There are "dark webs", but this isn't them. (Score:5, Insightful)
But as you say, this is something completely different.
Re: (Score:2)
Indeed, it is called the deep web [wikipedia.org].
Even the first link uses that term. The submitter messed up (and the editors didn't catch it. News at 11)
Yes, that's true. (Score:2)
BS, TFA says Yahoo is a search engine! (Score:1)
Re: (Score:2)
Image search needs help I guess (Score:1)
Not so new? (Score:5, Interesting)
Google has long since favoured PDFs - and gives them boosted results, under the guise that anyone who makes a PDF has something serious to say, I guess.
You may have noticed of late that people are wise to this - there are a bunch of sites that are embedding popular search terms / results in PDF files, and clustering their sites with adverts.
Re: (Score:2)
Ive noticed this. Lots of the top 5 search results for the items I usually search are PDFs. I just figured that publishing a pdf is something large organization usually does and large organizations tend to have a higher pagerank. Im not sure if PDF is in itself something that can raise a score.
Re: (Score:1)
The new part is now Google can index PDFs that have no text, only embedded images, via OCR. These are pretty common as a way of posting scanned multi-page documents online; for example many older academic papers are posted this way. Google Scholar should become more useful due to this.
Tesseract (Score:5, Interesting)
Not so sure about PDFs as an image format - which is exactly what you have when you use PDF to hold scanned documents. I think the more interesting point is that they feel they have an OCR package good enough to be trustworthy. I wonder if it's based on the Tesseract OCR software [blogspot.com] that they adopted a while back?
I played with it for a while, and got very poor results from the command line. Even when I made a png or bmp of a full screen single word "HELLO" in 200 pixel font with GIMP (about as perfect as input gets!) I'd often get "HEHO" or "H3H0" or god only knows what else.
Of course, this is when the project relaunch was first announced a year or two ago, I certainly hope it's better now! Looking at their web page, it does appear that there's some significant activity [google.com] going on. Yay Google!
Maybe I'll try it again, and see if it's worth using yet?
More to it (Score:2, Insightful)
This would help bring together a lot of research that is done, but hard to sort through. Then, implement a smart system using a program like Mathematica to find variations of the equations, etc., and see where duplicat
Re: (Score:2)
So if you're looking for the pythagorean formula z^2 = x^2 + y^2 say, (ie you've forgotten the name), then you'll miss the documents which contain c^2 = a^2 + b^2, etc.
And that's the easy case, because a lot of people write z^2 = x^2 + y^2. What if for some reason your natural inclination is to type u^2 + f^2 = K^2? You'd be missing out on virtually every relevant link, because most mathematicians li
Re: (Score:2)
Actually, if the problem of recognizing a formula in text or graphics has been solved, the second part, graphing the formula, normalizing the graph and storing/retrieving graphs that meet certain criteria is quite simple.
In other words, getting from the graphics to z^2 = x^2 + y^2 is the tough part. Once you're there, understanding that z^2 = x^2 + y^2 is equivalent to a^2 + b^2 = c^2 is easy.
Re: (Score:2)
What I'm saying is that's the tough part, whereas the OCR is comparatively easy (eg InftyReader [inftyproject.org]).
You can only transform an equation if you know its meaning (ie the rules of transformation embodied by the context in which it is being written). And understanding the meaning is a hard AI problem.
Re: (Score:2)
And understanding meaning is a hard AI problem
Not within a restricted knowledge domain. Mathematics, engineering, physics, etc. are some excellent examples of such domains.
Been there, done that. Back when the Internet was still text based.
Re: (Score:2)
For example, the "pythagorean" formula discussed above, z^2 = x^2 + y^2, doesn't carry any restrictions that tell you something. Are the variables numbers? In what kind of range? Are they matrices? Are they operators? Are the "2"s indices (labels) or e
Re: (Score:2)
The reason is that the symbols and operations are overloaded to such an extent that you cannot, by looking only at a formula, know what it means.
So, how do humans read and "understand" such a formula, sitting by itself, with no surrounding context? Answer: They don't. The same holds true for machines. The following equation: z^2 = x^2 + y^2 only makes sense if the terms and notation are defined for the context, most likely in the surrounding text. Likewise, typing in the search term: c^2 = a^2 + b^2 doesn't give either a human or a machine enough to go on. In either case, there are two approaches. One, prompt the user for further constraints. Or two
Re: (Score:2)
Precisely, and both known approaches are imho useless for the purpose of the OP, which is to type in an equation and obtain relevant documents in the case that he doesn't remember the context or wants variations.
The 'google' type response for formulas has high recall and very low precision. In fact, it effectively exists already for code sea
z^2 = x^2 + y^2 (Score:1)
For your solution, the database entry would be something like this:
(arbitrary value 1)^2 = (arbitrary value 2)^2 + (arbitrary value 3)^2, (arbitrary value 1)!
Re: (Score:2)
If you think of say z^2 = x^2 + y^2 as an expression that belongs to a formal grammar of arithmetic (computer languages have formal grammars too), then you could use the rules of the grammar to check when two formulas are equivalent, and you could store in your database a canonical form for each such formula. This sort of thing was actually proposed by the great mathematician Hilbert a hundred y
Re: (Score:1)
Please add DjVu (Score:2)
*SCANNED* PDFs (Score:2)
How about adding the word *scanned* into the headline, just as the original headline was.
That way others won't have to read the summary going "Hey, I thought Google was searching PDFs for the last 10 years."
*sigh* eternal september (Score:1)
aparently the world at large loves to shit on standards and practices.
it's been a while since search engines actually returned results I was looking for. google, yahoo, msn, metacrawler,.. they all want my money. "-com" + adblock doesn't really help anymore. I'm so sick and tired of the net. it once was the best thing that ever happened to the world. now it's the hyper-communication tool for fart jokes and perversion.
guess that tells you a lot a
Re: (Score:2)
I thought you were being cynical, but then I found http://www.fart-joke.com/ [fart-joke.com] . Ah, well, all good things must come to an end.
How ironic that the uselessness of the web as a serious communications tool should be discussed on the web.
Re: (Score:1)
but, yeah, I get your point, there are still safe havens. granted.
you know what I like about genmay.net or somethingawful.com? they once spearheaded the development that the net is now witness to. I visited them regularly for my local and esoteric laugh.
but then that shit hit mainstream - it was just the logical conclusion to the net. now "tits or gtfo" is common - same as 1337 once was a marker, now it's public and even grounds for bemus
Re: (Score:1)
just don't google bukkake or hentai.
Well, not until you get home, anyhow...
What'll be really cool is.... (Score:2)
It'll be even cooler when Google are able to automatically detect things like citations and references, and add hyperlinks as appropriate.
It still sort of bugs me that scientific papers are written in LaTeX, and not hypertext, especially considering that the web (in its current form) originated at CERN.
So? (Score:2)
There's a module in CPAN [cpan.org] for this. It rips out the images and runs them through Tesseract. [sourceforge.net] It's worked well the few times I've tried it. Certainly well enough for search engine indexing.
Also, my understanding of the "dark web" concept was that it refered to sites that had no links going to them, so no spiders are able to access them. I'm not seeing how any of this would fix the "problem".
The only news here is that Google doesn't already index form content in drop down boxes and selection menus. S
javascript? (Score:1)
Some kid's blog will have a new entry "How did I crash Google?"