Forgot your password?
typodupeerror
Google News

Google Now Searches JavaScript 114

Posted by timothy
from the watch-for-the-scriptview-vans dept.
mikejuk writes "Google has been improving the way that its Googlebot searches dynamic web pages for some time — but it seems to be causing some added interest just at the moment. In the past Google has encouraged developers to avoid using JavaScript to deliver content or links to content because of the difficulty of indexing dynamic content. Over time, however, the Googlebot has incorporated ways of searching content that is provided via JavaScript. Now it seems that it has got so good at the task Google is asking us to allow the Googlebot to scan the JavaScript used by our sites. Working with JavaScript means that the Googlebot has to actually download and run the scripts and this is more complicated than you might think. This has led to speculation of whether or not it might be possible to include JavaScript on a site that could use the Google cloud to compute something. For example, imagine that you set up a JavaScript program to compute the n-digits of Pi, or a BitCoin miner, and had the result formed into a custom URL — which the Googlebot would then try to access as part of its crawl. By looking at, say, the query part of the URL in the log you might be able to get back a useful result."
This discussion has been archived. No new comments can be posted.

Google Now Searches JavaScript

Comments Filter:
  • Really? (Score:5, Insightful)

    by Anonymous Coward on Saturday May 26, 2012 @04:25AM (#40119115)

    Googlebot will have a very quick timeout on scripts and probably wont be more powerful than a standard home computer. How would that be useful for calculating digits of pi or bitcoin mining? It would take far longer than doing it the conventional way.

    • by SlovakWakko (1025878) on Saturday May 26, 2012 @04:31AM (#40119141)
      You can always cut the whole process into smaller steps, each providing URL that will initiate the next step. Or you can provide several URLs and have the Google cloud compute a problem for you in parallel...
      • by Anonymous Coward on Saturday May 26, 2012 @04:35AM (#40119159)

        I already do this using a system of CNAME's in a .xxx domain.

      • Re: (Score:1, Interesting)

        by Zero__Kelvin (151819)
        Even if this is possible, you would certainly be violating Google's guidelines and have your site blacklisted from Googlebot pretty quickly. Furthermore, you could be charge with theft of services.
        • by ThatsMyNick (2004126) on Saturday May 26, 2012 @06:01AM (#40119417)

          Anyone wanting to do this would be doing it on a dedicate website. They wont care about the domain or IP address being blacklisted from Google. And good luck with the theft of service charge, they never asked Google to index them. They did not even agree to any terms of service from Google. As I said, good luck.

          • "Anyone wanting to do this would be doing it on a dedicate website. They wont care about the domain or IP address being blacklisted from Google."

            So you are saying that someone would go through all the trouble of registering the domain, creating the code, and getting (or waiting for) Google to index it, then wouldn't care that Google would cease to execute the actual code before the desired results are obtained? Re-read what I wrote. I merely said it would be blacklisted quickly. I didn't say that it woul

            • So far blacklisting has worked pretty well for Google. Google has used it well to punish black hat SEO techniques.

              In this case though, if I dont care about my page rank, I would simply create tons of long length domain names for pennies (+icann fees). I would use few at a time and would care if Google blacklisted few at a time (I would be storing partial results, just like one of the parent mentioned, and the takeover should be seamless). It doesnt take a lot to recoop your domain name fees if your task is

              • It doesnt take a lot to recoop your domain name fees if your task is purely computational.

                Dedicated hardware is cheap, and designing software costs a lot of money and time. What you are proposing would be ridiculously convoluted and costly, even disregarding the legal ramifications. We software engineers often talk about using the right tool for the right job. Your outlandish proposal ignores numerous sound engineering principles, not the least of which is adhering to this simple maxim.

                • May be not. But if someone wanted to do it just for the heck of it, it can be done. It may not scale very well, otherwise I dont see issues at all with it.

                  • As I was trying to explain, there is a huge difference between the problems you can see with it and the actual long list of problems that any moderately competent software engineer could quickly point out.
                    • I think you missed the "just for the heck of it". I understand my approach is not the practical one, and any sane person would just use their resources to do what little can be done and implement it on their own hardware. But it does it mean it cannot be done in a no loss way. Say I want to calculate the last 100 digits of Graham's number, it is can be split into multiple calculations, a sub result calculation can take less than a second (which is what I assume Google will limit the runtime to). The bandwid

                    • Let's start with the simplest problem. You plan on having Googlebot load and run your client side code. Great. Now how do you plan to get Googlebot to feed you the result?
                    • by ThatsMyNick (2004126) on Sunday May 27, 2012 @06:17AM (#40127275)

                      Your JS would generate HTML on the client side. Just generate a link that your server can understand. Google bot, doing what it does, will try to load this URL. When it does, the server stores this result, and generates a new problem for GoogleBot to solve. This is the basis, for the article and the entire comment thread.

                    • "Your JS would generate HTML on the client side."

                      Like I said, you are making assumptions about Googlebot. You seem to think that they have no idea how to sanitize an input and will just execute whatever you send them byte for byte. That's not going to happen.

                    • Er, they are looking for JS that generates HTML (So this is not an assumption). The purpose of GoogleBot is to index. If they run the JS and dont even index the results, it is makes no sense.
                       
                      Would you mind specifically mentioning what I assumption I am making. And there is no way to Sanitize JS (JS is a turing complete language, there is no way (atleast as far as present day research) to santize it in any reasonable way)

                    • Soory about the typos, I guess I need to get some sleep.

    • Depends how often they hit your site. Google has been known to check sites pretty regularly.
    • by Sloppy (14984)

      Wait a minute, are you suggesting that having spiders run my javascript x86 emulator which runs jruby scripts which mines bitcoins, isn't practical?

  • by Anonymous Coward

    why having other parties fetch your arbitrary code and execute it is such a wonderful idea.

  • by maxwell demon (590494) on Saturday May 26, 2012 @05:07AM (#40119259) Journal

    Send Google JavaScript which generates different results for Google than for normal visitors, in order to rank up the site.

    • by aaronb1138 (2035478) on Saturday May 26, 2012 @05:18AM (#40119287)

      What is this method you have written, "sudo_mod_me_up?"

    • by Anonymous Coward

      You don't need JavaScript for that. A lot of servers serve different HTML to Google than to us. It's especially noticeable when searching for a rare term; Google will show you results that appear to contain the term, but without relevant context (only mystifying unrelated terms) and when you open it the page turns out to have some completely different subject.

    • by Anonymous Coward

      I noticed this in a PHP attack script earlier this year. It installs a script pointing to a Russian malware domain, but only inserts it in the page if the user agent is not GoogleBot or a few other spiders. It also checked for some Google ip ranges. Surely Google must be combating this by doing some stealth spidering, otherwise SEO and malware providers will game them if they stick to their classic robot rules.

    • This is already being done, but in reverse. Google doesn't like it much either. Get caught, and you are de-listed.
      • The point is, with Google executing JavaScript you could make it less obvious, by just having the JavaScript depend on some difference between the Google and the Browser JavaScript execution (maybe timings of certain rendering operations).

        Also, it might be used through XSS, to have competitors delisted.

    • Or one that generates useful looking links to other sites you own (on different servers and subnets, of course).
    • I would be surprised if the googlebot didn't try everything to appear to the server like a normal user browser. Even better would be to crawl a site while in disguise, then again while not disguised. Differences would affect the sites ranking negatively.

      • Serving different content based on IP or self-identification is possible even without JavaScript. However if the detection makes use of peculiar behavior of the JavaScript implementation (and the JavaScript implementation will have to have some differences, or else it won't find content which is initially hidden, but unhidden by an user interaction), just fetching from a different UI or with a different browser/spider identification doesn't work.

        And BTW, the spider will certainly expose itself from the very

  • When I was looking at the page previews (in google) of my JavaScript network scanner, I noticed it listed some IP's, indicating that it was running the script. Just google "http://bwns.be/jim/scanning_printing/detect_range.html" and look at the preview. (Also, most of those IP's probably exist, as my script indicates it is sure about them).
    • You typoed your url. You have detect_range.html which is actually detect-range.html
    • Now that you said it. The preview Google shows of one of my sites has all the CSS aplied, including some that is aplied by javascript after the page load.

  • so much for (Score:5, Insightful)

    by Anonymous Coward on Saturday May 26, 2012 @06:24AM (#40119477)

    using javascript to hide or obfuscate email addresses to help protect them from spammers, scammers and bots.

    thanks fer nuttin, google.

    • robots.txt
    • Uhm, years ago one could already do that using SpiderMonkey and some Perl. It's what I used to report nasty redirects in Blogspot/Blogger to Google (thousands and thousands). It took me some time, but Google did see the light and the problem was resolved.

      Why do people keep thinking that spammers are retards? If it can be abused, it will be. And spammers/cybercriminals are among the first to do so.

    • by goaxcap (2648385)
      Use images or flash to show up email
  • by Anonymous Coward

    Now Google controls the client, the search engine and the analytics it should not be too difficult for them to see how traffic is flowing between sites. Pages need not even be physically linked for Google to see a connection. E.g. reading an article on the BBC may cause people to search for a company. With people signing into Chrome Google Google must have some very rich logs.

  • by Anonymous Coward

    Although maybe not quite in the same context. Google used to display javascript-munged email addresses in their search results until some of the larger sites involved, such as Rootsweb, complained.

  • by Anonymous Coward

    I really hope website developers and web application developers know the difference between GET and POST requests.

    Else, this could turn ugly.

    • by physburn (1095481)
      I've often programmed write new article, or add item, GET links, and also javascript actions. Which would mean google is going to be spamming forums and databases. Whats the robots.txt command to prevent going running the javascript on a page?
      • by xOneca (1271886)
        Maybe put Javascript functions on a separate file and use robots.txt to ban bots access.
  • I can already picture hackers drooling at the idea of turning Google's cloud into the ultimate zombie network.

  • by The MAZZTer (911996) <<megazzt> <at> <gmail.com>> on Saturday May 26, 2012 @09:46AM (#40120435) Homepage
    If you check out some of the thumbnails, it looks like Googlebot is using a customized version of Chrome now. You can see it blocking plugins.
  • It's inevitable. Someone will figure out a way to abuse the system that google hasn't thought to make contingencies for yet. I'm on the fence as to whether this is a good idea. I just hope they know what they're doing.
    • by dave420 (699308)
      Yeah, it's true - Google clearly knows nothing about searching the internet. ;)
      • Dave, every time they make a change like this, they get hammered. They made some big changes the release before "panda" and the site was useless for almost a year.
  • by Hentes (2461350) on Saturday May 26, 2012 @01:59PM (#40122065)

    You don't need to actually run the scripts, most of the time it's enough to just scrape the strings and links out of them.

  • Oh yeah, fuck accessibility. Fuck the web in general. "It's better for everybody". That's literally all you need to know. "Just go ahead and remove that from your robots.txt".

    I'm not saying there may not be good reasons (e.g. having the CSS and Javascript actually makes it possible to detect invisible text and whatnot, without that search engines may not even have a chance), but I really would appreciate some good reasoning, not being talked to like a fucking 5 year old.

    Or hey, how about adding that "of cou

  • Spammers! (Score:4, Informative)

    by xenobyte (446878) on Sunday May 27, 2012 @03:06AM (#40126669)

    They've been testing this for a while - We've already had the first complaints against someone spamming an email that only exists in exactly one place: Online as the result of some (trivial) javascript. Turned out that if you Googled the page, the result snapshot included the javascript generated email... In other words - it's already there and this will effectively kill javascript as a way of hiding functioning mailto links. Okay it would be fairly simple to add a condition based on the User Agent as GoogleBot is easily identified but it will make things a bit more complicated for the average user.

We are experiencing system trouble -- do not adjust your terminal.

Working...