Mutating: The Results

Dec 2nd, 2008 | By | Category: Dear Science Column

I asked you to help me with an experiment as a follow up to a recent column on mutation.

Here are the results:


(Click on the image for a full-sized version.)

Ultimately, I decided to not filter out all of the noise comments (including my own) that weren’t attempts to copy the original. Almost all of these clustered together in the green block.

The attempts that riffed off the original–like Fnarf‘s and Urgutha Forka‘s–clustered together as well in the blue blocks.

My original paragraph was slotted in as comment zero, located in the dendrogram as the left-most leaf in the red block. All of the legitimate attempts to copy the paragraph ended up clustered together in the red block.

A few cool mutations emerged. My original:

CCR is short for chemokine receptor. Chemokines and chemokine receptors allow the cells in your immune system to speak to one another; their epic fight against invaders is like a game of Marco Polo. CCR5 is the chemokine receptor found on macrophages—the gobbling-up cells at the front line of your immune system.

Luckier’s Comment #10:

CCR is short for chemokine receptor. Chemokines and chemokine receptors allow the cells in your immune system to speak to one another; their epic fight against invaders is like a game of Marco Polo. CCR5 is a the chemokine receptor found on macrphages–the gobbling-up cells at the front line of your immune system.

Like most mutations during the copying of DNA, the differences in the copies didn’t really change the meaning, just a few little details of how it was written or punctuated.

See any others?


Here’s the python code it took to make this output:

from HTMLParser import HTMLParser
import string
import sys
import os.path
import copy
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
from hcluster import pdist, linkage, dendrogram, totree
import numpy


class Spider(HTMLParser):
    def __init__(self, file, inlist):
        HTMLParser.__init__(self)
        self.inComment = False
        self.tempComment = []
        self.commentList = inlist
        self.feed(file.read())
        

    def handle_starttag(self, tag, attrs):
        if tag == 'div' and attrs:
            for att in attrs:
                if att[0] == "class" and att[1] == "commentText clearfix":
                    self.inComment = True
                        
    
    def handle_data(self,data):
        if self.inComment:
                self.tempComment.append(data)
    
    def handle_endtag(self, tag):
        if tag == 'div' and self.inComment:
            self.commentList.append(copy.deepcopy(self.tempComment))
            self.tempComment = []
            self.inComment = False
    
    def getCommentList(self):
        return self.commentList
        


filename = sys.argv[1]

if os.path.isfile(filename):
    myfile = open(filename,'r')
    rawCommentList = []
    # Let's initialize the CommentList with my original seed text
    CommentList = [{'text': "								CCR is short for chemokine receptor. Chemokines and chemokine receptors allow the cells in your immune system to speak to one another; their epic fight against invaders is like a game of Marco Polo. CCR5 is the chemokine receptor found on macrophages--the gobbling-up cells at the front line of your immune system.", 'num': '0'}]
    tempVector = []
    for c in CommentList[0]['text']:
        if not c == '\t':
           tempVector.append(ord(c))
    
    CommentList[0]['vector'] = copy.deepcopy(tempVector)
    Spider(myfile, rawCommentList)
    
    for rawComment in rawCommentList:
        tempCommentDict = {}
        tempCommentDict['num'] = rawComment[2]
        tempCommentDict['text'] = rawComment[6]
        tempVector = []
        for c in rawComment[6]:
            if not c == '\t':
                tempVector.append(ord(c))
        
        tempCommentDict['vector'] = copy.deepcopy(tempVector)
        
        CommentList.append(copy.deepcopy(tempCommentDict))
    

    vectorList = []
    maxVectorLen = 0
    for Comment in CommentList:
        vectorList.append(Comment['vector'])
        if len(Comment['vector']) > maxVectorLen:
            maxVectorLen = len(Comment['vector'])
        
    for index, v in enumerate(vectorList):
        if len(v) < maxVectorLen:
            paddingLen = maxVectorLen - len(v)
            vectorList[index] = v+ [0]*paddingLen
    
    dm = pdist(vectorList)
    lm = linkage(dm)
    dendrogram(lm)
    
    plt.savefig('plot.png', dpi=(200))