Catch up on stories from the past week (and beyond) at the Slashdot story archive

 



Forgot your password?
typodupeerror
×
Google The Internet

Google Indexing In Near-Realtime 79

krou writes "ReadWriteWeb is covering Google's embrace of a system that would enable any Web publisher to 'automatically submit new content to Google for indexing within seconds of that content being published.' Google's Brett Slatkin is lead developer of PuSH, or PubSubHubbub, a real-time syndication protocol based on ATOM, where 'a publisher tells the world about a Hub that it will notify every time new content is published.' Subscribers then wait for the hub to notify them of the new content. Says RWW: 'If Google can implement an Indexing by PuSH program, it would ask every website to implement the technology and declare which Hub they push to at the top of each document, just like they declare where the RSS feeds they publish can be found. Then Google would subscribe to those PuSH feeds to discover new content when it's published. PuSH wouldn't likely replace crawling, in fact a crawl would be needed to discover PuSH feeds to subscribe to, but the real-time format would be used to augment Google's existing index.' PuSH is an open protocol, and Slatkin says that 'I am being told by my engineering bosses to openly promote this open approach even to our competitors.'"
This discussion has been archived. No new comments can be posted.

Google Indexing In Near-Realtime

Comments Filter:
  • by Pojut ( 1027544 ) on Thursday March 04, 2010 @01:02PM (#31359520) Homepage

    ...someone help me out here. People can still find my articles through google before I see the googlebot hit any new articles I post...how is that possible? How would my pages show up on google before the bot actually crawls them?

    • by NovTest ( 909599 )
      Test
    • by garcia ( 6573 ) on Thursday March 04, 2010 @01:17PM (#31359758)

      My site is by no means something high traffic but Googlebot indexes my pages (and shows them in search results) within three minutes:

      crawl-66-249-65-232.googlebot.com - - [04/Mar/2010:10:33:34 -0600] "GET /current-crime-decline-to-cause-public-safety-cuts HTTP/1.1" 200 47330 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

      I really don't see a need for something to be any more "real time" than that for someone's blog. Do you?

      • by Entrope ( 68843 )

        Absolutely. With this breakthrough technology, a cutting-edge new media purveyor can ensure that their reportage, opinions and commentary are easily accessible to the general public with a minimal delay. In today's fast-paced Internet, a few minutes' delay can make the difference between being on the breaking edge of news and being an Johnny-come-lately.

        (To be more succinct, PuSH lets bloggers make sure they have the first post.)

      • by mmkkbb ( 816035 )

        Some blog engines will automatically notify search engines of an updated site map upon publishing new content.

      • by Pojut ( 1027544 )

        I really don't see a need for something to be any more "real time" than that for someone's blog. Do you?

        Not really...I generally update my site between 4-6 times per week, but when I update it I'm only posting one article a day with the odd site announcement every so often...maybe I just suck, I don't know, but it seems like it takes a week or two before people start really reading what I write, they always seem to read what I wrote a week or so ago instead of the new content. This happens even if they land on my main page (linked in my sig) rather than on an actual article. ::shrug:: whatever. I average be

        • Re: (Score:3, Interesting)

          by garcia ( 6573 )

          maybe I just suck, I don't know, but it seems like it takes a week or two before people start really reading what I write, they always seem to read what I wrote a week or so ago instead of the new content.

          As you write more often (say on a specific time schedule and daily) the people who don't read via RSS (which in my case is the majority of my readers) will learn to make going to your site a part of their daily routine and thus your visits on new material will go up.

          I watched visiting trends, by hour, over

          • by Pojut ( 1027544 )

            Cool, thank you! I'll definitely have a look at that.

          • by vux984 ( 928602 )

            I watched visiting trends, by hour, over the last two years in Google Analytics and picked 7:30 AM and 10:30 AM as the times to post material. It seemed as if most people were checking once in the morning when they got to the office and once at breaktime/lunchtime around 11 AM. To account for some of the time variance seen across those two years I went with 15 minutes earlier than the stats showed. Seems to work for me.

            Odd that everyone who reads your content is in your timezone. Do you primarily post artic

            • by garcia ( 6573 )

              95% of my content isn't just local, it's hyperlocal. Thank for asking about this as I did limit the analysis to those who I put into an "Advanced Segment" where the visitors' region was Minnesota.

      • I've noticed that when I post a new blog entry on Livejournal, it appears in Google's results within 2-3 minutes. I know that Livejournal has a public feed for all new blog entries across the site, so I assume Google must be indexing this (and presumably others).

      • by Jurily ( 900488 )

        I really don't see a need for something to be any more "real time" than that for someone's blog. Do you?

        In rare cases like the swine flu panic, 3 minutes can be the difference between fame and obscurity.

    • Google can see the future. Didn't I tell you about that tomorrow?
    • If Google makes a new sight queue and then You could request your URL be put on that Queue. Then the google Bot would not have to find your content from links on old URL's.

      The result, your content scanned in seconds not hours or days.
    • It's like an RSS feed for Google. Just like you'd use an RSS feed to keep up with various blogs instead of visiting constantly.

    • by zonky ( 1153039 )
      RSS?
  • kinda done now (Score:5, Informative)

    by hey ( 83763 ) on Thursday March 04, 2010 @01:05PM (#31359550) Journal

    If google notices your site/blog updates frequently the bot will come around more often and especially if its a high page rank site.

    • That is still slower, not to mention far less efficient for both parties, than event-driven updates.

      • by shird ( 566377 )

        1. Go to 4chan/b and post a unique sentence.
        2. Observe how quickly stuff gets posted to that site.
        3. Search for that sentence through Google
        4. Be amazed that Google actually indexes this site.

    • There is no such thing as a high Page Rank site. The name Page Rank is a play on words: for one, it is the inventor's last name (Larry Page). Two, it is on a per-page basis.

  • How is this any different from sitemaps [wikipedia.org]? Sitemaps are by major search engines and have been in use for years now.
    • that involves the googlebot hitting the site map, or you submitting it manually...

      this is all automatic.

      However, How is this any different from RSS? (except this is designed to be viewed by a machine rather than a human?)

      • The only way a standard RSS reader can find out if a feed has updated is by "polling" the feed periodically. PuSH and similar systems remove the need for this polling by pinging the client directly when something changes.
      • However, How is this any different from RSS? (except this is designed to be viewed by a machine rather than a human?

        RSS is a pull technology. I update my blog, which updates my RSS feed and the googlebot goes out and pulls my sitemap (which is my RSS feed on Blogger) and indexes any new pages. This technology sounds like I can ping Google when my site is updated and they can know there is new data for them to pull.

        • No, this is not a ping technology. The hub actually sends the new data to the recipient. So basically you publish a feed. The hub subscribes to that feed. When you post new content, you ping the hub. The hub then fetches the new data. It then turns around and sends the new data to anyone who's subscribed to the hub. So it saves on two fronts. First, there's no polling of anything anymore (since you tell the hub when it's updated, and the hub sends out the new data when it has it). Second, the load
      • RSS is pull technology, so the interested server (ie Google) needs to keep polling you asking if you have new content.

        PubSubHubbub is push technology. So when you make a change, you submit it to a hub which in turn knows the interested parties that have asked to know about your site and then distributes it to them.

        So it is more efficient since there isn't a constant polling and it is faster since there isn't a poll lag.
    • Yes, a good site map, lists the last changed date for each page. Google reads the site map for each site first. So the above Author is right the PUSH system is already integrated into sitemaps in the last Modified and changed attributes, and no new protocols or hubs systems are needed.

      ---

      Internet Protocols [feeddistiller.com] Feed @

  • by Rogerborg ( 306625 ) on Thursday March 04, 2010 @01:10PM (#31359672) Homepage
    GOTO Subject
  • by hey ( 83763 )

    This sounds a bit like Twitter. Put your content in one hole and it comes out lots of places.

  • "If a tree falls in the forest and no one is around to hear it, does it make a noise?"

    internet era update:

    "If a webpage is published on the web and no google spider notices it, does it exist?"

    near future update:

    "If a thought enters your mind that is not already indexed by google, is it real?"

  • by 140Mandak262Jamuna ( 970587 ) on Thursday March 04, 2010 @01:34PM (#31360018) Journal
    Funny I just posted this yesterday in Pandas Thumb [pandasthumb.org]

    As usual I tried to make a tongue in cheek remark and ended up chewing my tongue. I meant Google’s indexer is so fast. Original posting was made at March 3, 2010 2:09 PM. It was in the index by March 3, 2010 5:08 PM. And it was not even from news.google.com, it is the general web search. Pretty soon Google will tell me that I’m out of milk even before I open the fridge door.

    • Re: (Score:2, Funny)

      Pretty soon Google will tell me that I'm out of milk even before I open the fridge door.

      It also knows what you did last summer. *ominous look towards the laptop in the corner*

    • by Splab ( 574204 )

      Hope it isn't too far away, having my google apps account telling me what I need to restock in the fridge (or even the apartment) would be friggin awesome. Then when cookingwithgoogle.com starts up, just writing the recipe I want could give me a grocery list, instant win.

      • I'd like to put together a kitchen computer with a camera/barcode reader to keep track of what's in my fridge.

        If food came RFID tagged, it would work even better. Of course RFID & food don't mix too well.

      • Almost all the food is bar coded. And bar code readers are cheap. Barcode readers with some local memory could be built. Or wi-fi enabled to transmit the bar code to a local computer.

        We should be able to build contraptions where you scan every empty carton you throw in the garbage, and it updates the inventory and emails a shopping list, sorted by the aisle for my local grocery store, thank you, to your cell phone.

        Yeah, if I can think about it, I am sure someone has already done it. I am not exactly t

        • I seriously thought about this once, and realised that the supermarkets will NOT cooperate.

          Ever notices how supermarkets are forever changing the location of your favourite product? They want you to walk through the whole store because that way you are likely to make additional/unplanned purchases. Having a shopping list sorted by store aisle would defeat their nefarious marketing plans.

          I thought of using user-generated data to create the store maps, (i.e scan the barcode when you grab an item off the shel

          • Aisle wise sorting is just the icing on the cake. Simply having a battery operated bar code scanner next to the garbage can, so that we can scan what we toss, (things that we want to restock) is enough. When you plug the scanner into a smart phone, it dumps the data and the phone has an app that looks up the upc in the web and converts it to a real shopping list. That is basically the important functionality. You can jazzit up by making the scanner really small and portable and you can carry it to the stor
      • by D Ninja ( 825055 )

        This is a very fantastic idea. I would love to have something like this as, when I typically go to the grocery store, I find myself buying the same stinking food again and again (it's tough to have a good imagination when you're in a rush).

        Any Google engineers out there with a penchant for cooking - this would be a great 20% time project.

      • Hope it isn't too far away, having my google apps account telling me what I need to restock in the fridge (or even the apartment) would be friggin awesome. Then when cookingwithgoogle.com starts up, just writing the recipe I want could give me a grocery list, instant win.

        Some of these services annoy me because I don't want to be a creature of habit in everything I do. I personally want some variety from time to time and being able to predict individual whims is so far out in the future its not even scifi, its plain fantasy. Or maybe there is an overall pattern there, something that says routine for 4 weeks, then 75% chance of a random choice of ingredients from Wed to Fri and 95% on weekends. But if there is, I don't want to know about it and more importantly, I don't want

  • Keep in mind Google is quickly becoming an all controlling entity.
    I have concerns that this technology could expose users to additional threats.
    Likely I see it as one more way for Google to corner the search market.
    Lastly I ponder the legal implications of a direct tying to a web site's content. What if there is a copyright violation.

    Generally I find this to be a dud tech.
    Long ago we had to publish to search engines then the crawlers came and life was good.
    Again automation is what made things better.
    Diving

    • by dark_15 ( 962590 )

      This was a triumph... I'm making a note here: HUGE SUCCESS!

      (For the uninitiated read the letters of the start of each sentence downwards.)

  • I remember when I worked on this back in the turn of the century, it was called GridIR back then. http://www.gir-wg.org/index.html [gir-wg.org] A subscription based indexing/search/collection engine.
  • Amusingly, since this is based on Atom, the client still has to poll. It just has to poll fewer sources. The connection between the original source and the "pushsubhub" server really is a "push" connection, but the hub to client connection is not.

    Also, the "pushsubhub" caches and redistributes the feeds, which means the feed operator no longer sees their own clients.

    They don't seem to have addressed the general RSS problem of "server timestamp/ID changed, but content did not". Some RSS feeds get this

    • The connection between the original source and the "pushsubhub" server really is a "push" connection, but the hub to client connection is not.

      This isn't right. You can see in section 7.3 of the spec that the hub sends an HTTP POST to each client (subscriber) for each update; there's no polling.

      • by Animats ( 122034 )

        This isn't right. You can see in section 7.3 of the spec that the hub sends an HTTP POST to each client (subscriber) for each update; there's no polling.

        You're right. Which implies that the subscriber has to have a web server. Somebody will probably try a "web server in the browser" thing for browser-type subscribers.

        To some extent, they've re-invented Usenet.

  • I updated my site on 15th Feb and today, 4th March, I can see the old links in Google and none of the new ones. It seems that they need also more servers...
  • It would be unreal how fast spammers would exploit this.
  • Seems to me that push-publishing already is implemented on the web via services like Ping-o-Matic [pingomatic.com] and such. I can't see why a new push-publishing method would be needed since the blog ping works elegantly. Obviously, the system is abused by spammers, but Google's solution would suffer from the same problem too.

E = MC ** 2 +- 3db

Working...