Changeset 178

Show
Ignore:
Timestamp:
04/20/08 02:25:44 (7 months ago)
Author:
eric.dumin..@gmail.com
Message:

Filters have been renamed to PlainTextExtractors?.

Filter is a Rails protected name.

Files:

Legend:

Unmodified
Added
Removed
Modified
Copied
Moved
  • branches/oo_indexer/README.txt

    r107 r178  
    1515Picolena has many advantages: 
    1616 
    17    * it can index .pdf, .doc, .docx, .odt, .xls, .ods, .ppt, .pptx, .odp, .rtf, .html and plain text files will full text search, and offers a very easy way to add new filters to index other filetype. 
     17   * it can index .pdf, .doc, .docx, .odt, .xls, .ods, .ppt, .pptx, .odp, .rtf, .html and plain text files will full text search, and offers a very easy way to add new extractors to index other filetype. 
    1818   * it is free as in free beer and as in free speech 
    1919   * thanks to Ferret, it is very fast 
  • branches/oo_indexer/lib/picolena/templates/app/models/document.rb

    r177 r178  
    5050  end 
    5151   
    52   # Returns true iff some Filter has been defined to convert it to plain text. 
     52  # Returns true iff some PlainTextExtractor has been defined to convert it to plain text. 
    5353  #  Document.new("presentation.pdf").supported? => true 
    5454  #  Document.new("presentation.some_weird_extension").supported? => false 
    5555  def supported? 
    56     Filter.supported_extensions.include?(self.ext_as_sym) 
     56    PlainTextExtractor.supported_extensions.include?(self.ext_as_sym) 
    5757  end 
    5858   
    5959  # Retrieves content as it is *now*. 
    6060  def content 
    61     Filter.extract_content_from(complete_path) 
     61    PlainTextExtractor.extract_content_from(complete_path) 
    6262  end 
    6363   
     
    6565  # Returns content as it was at the time it was indexed. 
    6666  def cached 
    67     IndexReader.new[index_id][:content] 
     67    from_index[:content] 
    6868  end 
    6969   
  • branches/oo_indexer/lib/picolena/templates/app/models/indexer.rb

    r177 r178  
    100100       
    101101      begin  
    102         text = Filter.extract_content_from(complete_path) 
     102        text = PlainTextExtractor.extract_content_from(complete_path) 
    103103        raise "\tempty document #{complete_path}" if text.strip.empty? 
    104104        fields[:content] = text 
  • branches/oo_indexer/lib/picolena/templates/app/models/plain_text_extractor.rb

    r177 r178  
    1 require 'filter_DSL' 
     1require 'plain_text_extractor_DSL' 
    22 
    3 class Filte
    4   include FilterDSL 
    5   @@filters=[] 
     3class PlainTextExtracto
     4  include PlainTextExtractorDSL 
     5  @@extractors=[] 
    66  class<<self  
    7     # Returns every defined filte
     7    # Returns every defined extracto
    88    def all 
    9       @@filters 
     9      @@extractors 
    1010    end 
    1111     
    12     # Add a filter to the filters list 
    13     def add(filter) 
    14       @@filters<<filte
     12    # Add an extractor to the extractors list 
     13    def add(extractor) 
     14      @@extractors<<extracto
    1515    end 
    1616     
    17     # Calls block for each filte
     17    # Calls block for each extracto
    1818    def each(&block) 
    1919      all.each(&block) 
    2020    end 
    2121     
    22     # Returns every required dependency for every defined filte
     22    # Returns every required dependency for every defined extracto
    2323    def dependencies 
    24       @@dependencies||=all.collect{|filter| filter.dependencies}.flatten.compact.uniq.sort 
     24      @@dependencies||=all.collect{|extractor| extractor.dependencies}.flatten.compact.uniq.sort 
    2525    end 
    2626     
    2727    # Returns every supported file extensions 
    2828    def supported_extensions 
    29       @@supported_exts||=all.collect{|filter| filter.exts}.flatten.compact.uniq 
     29      @@supported_exts||=all.collect{|extractor| extractor.exts}.flatten.compact.uniq 
    3030    end 
    3131     
    32     # Finds which filter should be used for a given file, according to its extension 
     32    # Finds which extractor should be used for a given file, according to its extension 
    3333    # Raises if the file is unsupported.  
    3434    def find_by_filename(filename) 
    3535      ext=File.ext_as_sym(filename) 
    36       filter=all.find{|filter| filter.exts.include?(ext)} || raise(ArgumentError, "no convertor for #{filename}") 
    37       filter.source=filename 
    38       filte
     36      extractor=all.find{|extractor| extractor.exts.include?(ext)} || raise(ArgumentError, "no convertor for #{filename}") 
     37      extractor.source=filename 
     38      extracto
    3939    end 
    4040     
    41     # Launches filter on given file and outputs plain text result 
     41    # Launches extractor on given file and outputs plain text result 
    4242    def extract_content_from(source) 
    4343      find_by_filename(source).extract_content 
  • branches/oo_indexer/lib/picolena/templates/config/initializers/004_load_filters.rb

    r170 r178  
    11require 'core_exts' 
    2 require 'filter_DSL' 
     2require 'plain_text_extractor_DSL' 
    33 
    4 Dir.glob(File.join(RAILS_ROOT,'lib/filters/*.rb')).each{|filter| 
    5   require filte
     4Dir.glob(File.join(RAILS_ROOT,'lib/plain_text_extractors/*.rb')).each{|extractor| 
     5  require extracto
    66} 
  • branches/oo_indexer/lib/picolena/templates/lib/plain_text_extractor_DSL.rb

    r177 r178  
    1 # Defines Filters with DSL 
     1# Defines plain text extractors with DSL 
    22# For example, to convert "Microsoft Office Word document" to plain text 
    3 Filter.new { 
     3PlainTextExtractor.new { 
    44#    every :doc, :dot 
    55#    as "application/msword" 
     
    1010#  } 
    1111 
    12 module FilterDSL 
     12module PlainTextExtractorDSL 
    1313  attr_reader :exts, :mime_name, :description, :command, :content_and_file_examples 
    1414   
     
    1616    @content_and_file_examples=[] 
    1717    self.instance_eval(&block) 
    18     Filter.add(self) 
     18    PlainTextExtractor.add(self) 
    1919    MimeType.add(self.exts,self.mime_name) 
    2020  end 
     
    3636  end 
    3737   
    38   #used by rspec to test filters: 
     38  #used by rspec to test extractors: 
    3939  #  which_should_for_example_extract 'in a pdf file', :from => 'basic.pdf' 
    4040  #  or_extract 'some other stuff inside another pdf file', :from => 'yet_another.pdf' 
    4141  # 
    4242  #this spec will pass if 'basic.pdf' and 'yet_another.pdf' are included in an indexed directory, if every dependency is installed, 
    43   #and if plain text output from the filter applied to 'basic.pdf' and 'yet_another.pdf' respectively include 'in a pdf file' and 'some other stuff inside another pdf file'  
     43  #and if plain text output from the extractor applied to 'basic.pdf' and 'yet_another.pdf' respectively include 'in a pdf file' and 'some other stuff inside another pdf file'  
    4444  def which_should_for_example_extract(content, file) 
    4545    @content_and_file_examples << [content,file[:from]] 
     
    6666      command_as_hash_or_string.invert[platform].dup 
    6767    else 
    68       block || raise("No command defined for this filter: #{description}") 
     68      block || raise("No command defined for this extractor: #{description}") 
    6969    end 
    7070    @command<<' 2>/dev/null' if (@command.is_a?(String) && platform==:on_linux && !@command.include?('|')) 
  • branches/oo_indexer/lib/picolena/templates/lib/plain_text_extractors/adobe.pdf.rb

    r173 r178  
    55#   Home page: http://www.foolabs.com/xpdf/ 
    66 
    7 Filter.new { 
     7PlainTextExtractor.new { 
    88  every :pdf 
    99  as "application/pdf" 
  • branches/oo_indexer/lib/picolena/templates/lib/plain_text_extractors/html.rb

    r173 r178  
    1 Filter.new { 
     1PlainTextExtractor.new { 
    22  every :html, :htm 
    33  as "text/html" 
  • branches/oo_indexer/lib/picolena/templates/lib/plain_text_extractors/ms.excel.rb

    r173 r178  
    11#Excel 97-2003 
    22 
    3 Filter.new { 
     3PlainTextExtractor.new { 
    44  every :xls 
    55  as "application/excel" 
     
    1212 
    1313require 'zip/zip' 
    14 Filter.new { 
     14PlainTextExtractor.new { 
    1515  every :xlsx 
    1616  as 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet' 
  • branches/oo_indexer/lib/picolena/templates/lib/plain_text_extractors/ms.powerpoint.rb

    r173 r178  
    11#Powerpoint 97-2003 
    22 
    3 Filter.new { 
     3PlainTextExtractor.new { 
    44  every :ppt, :pps 
    55  as "application/powerpoint" 
     
    1414 
    1515require 'zip/zip' 
    16 Filter.new { 
     16PlainTextExtractor.new { 
    1717  every :pptx 
    1818  as 'application/vnd.openxmlformats-officedocument.presentationml.presentation' #could that mime BE any longer? 
  • branches/oo_indexer/lib/picolena/templates/lib/plain_text_extractors/ms.rtf.rb

    r173 r178  
    55#   http://www.gnu.org/software/unrtf/unrtf.html 
    66 
    7 Filter.new { 
     7PlainTextExtractor.new { 
    88  every :rtf 
    99  as "application/rtf" 
  • branches/oo_indexer/lib/picolena/templates/lib/plain_text_extractors/ms.word.rb

    r173 r178  
    11#Word 97-2003 
    22 
    3 Filter.new { 
     3PlainTextExtractor.new { 
    44  every :doc, :dot 
    55  as "application/msword" 
     
    1313 
    1414require 'zip/zip' 
    15 Filter.new { 
     15PlainTextExtractor.new { 
    1616  every :docx, :dotx 
    1717  as 'application/vnd.openxmlformats-officedocument.wordprocessingml.document' 
  • branches/oo_indexer/lib/picolena/templates/lib/plain_text_extractors/opendocument.presentation.rb

    r173 r178  
    22 
    33require 'zip/zip' 
    4 Filter.new { 
     4PlainTextExtractor.new { 
    55  every :odp 
    66  as 'application/vnd.oasis.opendocument.presentation' 
  • branches/oo_indexer/lib/picolena/templates/lib/plain_text_extractors/opendocument.spreadsheet.rb

    r173 r178  
    22 
    33require 'zip/zip' 
    4 Filter.new { 
     4PlainTextExtractor.new { 
    55  every :ods 
    66  as 'application/vnd.oasis.opendocument.spreadsheet' 
  • branches/oo_indexer/lib/picolena/templates/lib/plain_text_extractors/opendocument.text.rb

    r173 r178  
    22 
    33require 'zip/zip' 
    4 Filter.new { 
     4PlainTextExtractor.new { 
    55  every :odt 
    66  as 'application/vnd.oasis.opendocument.text' 
  • branches/oo_indexer/lib/picolena/templates/lib/plain_text_extractors/plain_text.rb

    r173 r178  
    1 Filter.new { 
     1PlainTextExtractor.new { 
    22  every :txt, :text, :tex, :for, :cpp, :c, :rb, :ins, :vee, :java, :no_extension 
    33  as "application/plain" 
     
    1515  which_requires 'iconv' 
    1616   
    17   # to check if filter is working with basic plain text files 
     17  # to check if the extractor is working with basic plain text files 
    1818  which_should_for_example_extract 'Hello world!', :from => 'hello.rb' 
    1919  or_extract 'text inside!', :from => 'crossed.txt' 
  • branches/oo_indexer/lib/picolena/templates/lib/tasks/install_dependencies.rake

    r170 r178  
    3030  task :deb_packages do 
    3131    root_privileges_required! 
    32     #TODO: Should load this list from defined Filter
     32    #TODO: Should load this list from defined PlainTextExtractor'
    3333    packages=%w{antiword poppler-utils odt2txt html2text catdoc unrtf}.join(" ") 
    3434    puts "Installing "<<packages 
  • branches/oo_indexer/lib/picolena/templates/spec/helpers/documents_helper_spec.rb

    r170 r178  
    44  it "shouldn't raise if matching not in content field" 
    55 
    6   Filter.supported_extensions.each{|ext| 
     6  PlainTextExtractor.supported_extensions.each{|ext| 
    77    it "should have an icon for .#{ext} filetype" do  
    88      icon_for(ext.to_s).should_not be_nil 
  • branches/oo_indexer/lib/picolena/templates/spec/models/host_indexing_system_spec.rb

    r170 r178  
    22 
    33describe "Host indexing system" do 
    4  Filter.dependencies.each do |dependency| 
     4 PlainTextExtractor.dependencies.each do |dependency| 
    55    it "should have #{dependency} installed" do 
    66       IO.popen("which #{dependency}"){|i| i.read.should_not be_empty} 
  • branches/oo_indexer/lib/picolena/templates/spec/models/plain_text_extractor_spec.rb

    r170 r178  
    11require File.dirname(__FILE__) + '/../spec_helper' 
    22 
    3 describe "Filters" do 
     3describe "PlainTextExtractors" do 
    44  before(:all) do 
    55    IndexReader.ensure_existence 
    66  end   
    77   
    8   Filter.each{|filter| 
    9     filter.exts.each{|ext| 
    10       should_extract= "should be able to extract content from #{filter.description} (.#{ext})" 
    11       content_and_file_examples_for_this_ext=filter.content_and_file_examples.select{|content,file| File.ext_as_sym(file)==ext} 
     8  PlainTextExtractor.each{|extractor| 
     9    extractor.exts.each{|ext| 
     10      should_extract= "should be able to extract content from #{extractor.description} (.#{ext})" 
     11      content_and_file_examples_for_this_ext=extractor.content_and_file_examples.select{|content,file| File.ext_as_sym(file)==ext} 
    1212      unless content_and_file_examples_for_this_ext.empty? then 
    1313        it should_extract do 
     
    2222      else 
    2323        ## It means that the spec for this extension file is "Not yet implemented"! 
    24         ## add this line to the corresponding filter in lib/filters: 
     24        ## add this line to the corresponding extractor in lib/extractors: 
    2525        # which_should_for_example_extract 'some content', :from => 'a file you could add in spec/test_dirs/indexed/' 
    2626        it should_extract