Waiting for Singularity

Thoughts on science, tech and other preoccupations

Applying TDD in Writing Bioinformatics Software

Ruby for Biologists – Part 1

DNA sequence analysis is a basic task in molecular biology and is a good starting point for software development using Test Driven Development (TDD). Given the sequence of a common sequencing primer, “GCAATTAACCCTCACTAAAGG”, I wrote a program to calculate the percentage of each nucleotide. For this, I chose Ruby, a dynamic scripting language and Rspec, a testing tool for the Ruby language that uses simple language to describe expectations on how the code should work.

Starting with a simple example that expresses some basic behaviour, for example ‘it counts the frequency of nucleotides in a string’ and following TDD best practices, I wrote a failing test then wrote just enough Ruby code to make it pass. I then improved the code, added another expected behavior and repeated the cycle. In TDD circles, this is known as Red-Green-Refactor. Primer sequences are typically 18-24 nucleotides, but for genes that are several thousand nucleotides in length, a script that automates the task of calculating DNA statistics can be quite powerful.

Specs

    # dna_stats_spec.rb

    require 'rspec'

    require_relative 'dna_stats'

    describe 'Dna statistics' do
      it 'counts the frequency of nucleotides in a string' do
        expect(Dna.new('agccttaa').count).to eql([["a", 3], \
        ["c", 2], ["g", 1], ["t", 2]])
      end

      it 'counts the frequency of nucleotides in a string' do
        expect(Dna.new('nggaatncc').count).to eql([["a", 2], \
        ["c", 2], ["g", 2], ["n", 2], ["t", 1]])
      end

      it 'analyzes the dna' do
        expect(Dna.new('aaggcccttt').analyze).to eql([["a", 0.2], \
        ["c", 0.3], ["g", 0.2], ["t", 0.3]])
      end

      it 'displays the dna statistics' do
        expect(Dna.new('aaggcccttt').display).to eql("DNA statistics \
        (10 residue sequence) A: 20.0% C: 30.0% G: 20.0% T: 30.0%")
      end
    end

Solution in Ruby

    # dna_stats.rb

    class Dna

      attr_reader :sequence, :sorted_array, :statistics, :string

      def initialize(sequence)

        @sequence = sequence
        @sorted_array = []
        @statistics = []
        @string = ''
      end


      def count
        hash = {}
        sequence.each_char do | char |
          hash[char] ? hash[char] << char : hash[char] = [char]
        end

        hash.each do | key, value |
          hash[key] = value.size.to_i
        end
        #binding.pry
        @sorted_array = hash.sort
        return @sorted_array

      end

      def analyze
        @total_bases = 0
        count
        @sorted_array.each do | item |
          @total_bases += item[1]
        end
        string << "DNA statistics (#{@total_bases} residue sequence)"
        @sorted_array.each do | item |
          @statistics << [item[0], (item[1].to_f / @total_bases)]
        end
        return @statistics
      end

      def display
        analyze
        @statistics.each do | item |
          string << " #{item[0].upcase}: #{(item[1]*100).round(1)}%"
        end
        return string
      end
    end

    sequence = "GCAATTAACCCTCACTAAAGG"
    puts "DNA sequence: " + sequence
    seq = Dna.new(sequence)
    puts seq.display

Output

    ruby-bio-suite $ ruby dna_statistics.rb

    DNA sequence: GCAATTAACCCTCACTAAAGG
    DNA statistics (21 residue sequence) A: 38.1% C: 28.6% G: 14.3% T: 19.0%

Future implementations of this code will include methods for determining melting temperature, Tm and detecting hairpin loops.

Got 15 minutes? Give Ruby a try now: Try Ruby. Learn more about Ruby and Rspec here and here.