Since I started to use Google Reader, probably almost four years ago, I have been subscribing to Google’s blogs. The number of their blogs I subscribe to is around 20. I am not certain when I began to notice Google’s cross-posts, but I could say this: crossposting has become annoying.

Firstly, I don’t object crossposting. In fact, I did twice on my old blogs But I stopped because I don’t like copy-and-pasting the same thing. Instead, I write a short post to link to a post on another blog. I don’t mind other people do such thing because it’s your own content and your blogs, you can do whatever you want. Moreover, Google has never discouraged anyone to do crossposting since they do and their search engine is smart. Of course, they give you some SEO thing, you can add a HTML tag to tell the googlebot which is the original.

The interesting thing is Google is the only company I noticed doing intense crossposting. In order to show your some statistics why I use intense, I wrote a Python code to gather some numbers. But mind you, the results do not reflect whole picture, the real number might be higher. My code grabs FeedBurner feeds or Blogspot feeds and does calculation of the duplications. Each feed has different date span, but fixed number of posts, probably 20 posts. Some blogs post often, some don’t, therefore, some duplicates won’t be detected by this code. Also the duplicates is decided based on same post titles. Google maintains a blogs list of theirs, which is where this code get its hand on feed urls.

I know I should use Blogger API to set up date range, I am just being lazy. Though it’s not precise, but it’s enough.

http://i.imgur.com/bY3X7b.jpg

Now, here is the first result in image1 on the right. It’s a very long one. Each group is started by a number, which is how many posts with same title in different blogs. The number is followed by the post title. The following lines are the published or updated dates and the names of blogs.

As you can see most of them are one original and one duplicate, the result is 129 original posts have 162 duplicates from total 2,427 posts of 106 blogs. Roughly speaking, 6.67% chance you would read a duplicate, every 100 posts of Google’s nearly 7 posts are duplicates. The percentile could be higher as I explained above.

http://farm6.static.flickr.com/5253/5393876525_3aa3f3a39a.jpg

The second screenshot shows you the original posts with two or more duplicates, only 20 original posts. It means 109 original posts have only one duplicate, 84.5% of duplicated posts has one duplicate.

I think that’s all for the numbers.

I believe it’s better not to cross-post so often and they could write down why they want to cross-post at first place then turn those reasons into their own original post and link to that post. For example, many cross-posted to regional Google blogs. They can write about local activity or feedback from last year or from different view based on the type of the blog.

Cross-posting is quick and simple but also make cross-post very cheap. Even just a short update, a few lines, it’s better then duplicate a whole post in my opinion.

I have thought about a filter on Google Reader to group posts or hide duplicates, but it doesn’t sound quite right to me. It’s like spam filter, they don’t solve problem, they only cover the problem.

As for the Python code:

#!/usr/bin/env python

import os
import re
import sys

from contextlib import closing, nested
from urllib2 import urlopen

import feedparser as fp


BLOGLIST = 'http://www.google.com/press/blog-directory.html'
BLOGLISTHTML = '/tmp/google-blogs-directory.html'
BLOGLISTLIST = '/tmp/google-blogs.lst'
TMPDIR = '/tmp'


# Get the Blog feeds list
if os.path.exists(BLOGLISTLIST):
  with open(BLOGLISTLIST) as f:
    blogs = f.read().split('\n')
else:
  if os.path.exists(BLOGLISTHTML):
    with open(BLOGLISTHTML) as f:
      bloghtml = f.read()
  else:
    with nested(closing(urlopen(BLOGLIST)), open(BLOGLISTHTML, 'w')) as (f, fhtml):
      bloghtml = f.read()
      fhtml.write(bloghtml)
  blogs = []
  for feed in re.finditer('<a href="(.*?)">Subscribe</a>', bloghtml):
    blogs.append(feed.group(1))
  with open(BLOGLISTLIST, 'w') as f:
    f.write('\n'.join(blogs))

# Remove duplicate of Open Source Blog. Who knows why Google lists it twice?
# Showing their <3 to FOSS?
blogs = list(set(blogs))

total_entries = 0
entries = {}
# Get each feed
for feedurl in blogs:
  # Feed parsing costs much time
  sys.stdout.write('#')
  sys.stdout.flush()
  feedfile = '%s/%s' % (TMPDIR, feedurl.replace('/', '-'))
  if os.path.exists(feedfile):
    with open(feedfile) as ffile:
      feedraw = ffile.read()
  else:
    print 'Downloading %s...' % feedurl
    with nested(closing(urlopen(feedurl)), open(feedfile, 'w')) as (furl, ffile):
      feedraw = furl.read()
      ffile.write(feedraw)
  # Get each entry's title, date, url
  feed = fp.parse(feedraw)
  total_entries += len(feed.entries)
  for entry in feed.entries:
    # Five posts don't have title... weird...
    if not entry.title:
      continue
    data = (feed.feed.title, entry.get('published', entry.updated),
        entry.get('feedburner_origlink', entry.link))
    if entry.title in entries:
      # If same title from same feed, that doesn't count as a duplicate
      for post in entries[entry.title]:
        if feed.feed.title == post[0]:
          break
      else:
        entries[entry.title].append(data)
    else:
      entries[entry.title] = [data]

# only keep item's length > 1
new_entries = {}
for title, posts in entries.items():
  if len(posts) > 1:
    new_entries[title] = posts
entries = new_entries

# print out the results
print
for title, posts in entries.items():
  print ('\033[36m%d\033[0m \033[35m%s\033[0m' % (len(posts), title)).encode('utf-8')
  for post in posts:
    print ('  %-29s \033[32m%s\033[0m' % (post[1], post[0])).encode('utf-8')
    # print '   ', post[2] # Post link

dup_entries = sum([len(posts) for posts in entries.values()])
print
print '%d duplicates (in %d titles) / %d posts of %d blogs' % (
    dup_entries - len(entries), len(entries), total_entries, len(blogs))
[1]I am not giving you the text version because I don’t want those to be indexed by search engines.