Changeset 322

Show
Ignore:
Timestamp:
05/05/08 03:22:00 (7 months ago)
Author:
eric.dumin..@gmail.com
Message:

More documentation for core_exts and indexer

Files:

Legend:

Unmodified
Added
Removed
Modified
Copied
Moved
  • trunk/lib/picolena/templates/app/models/indexer.rb

    r321 r322  
     1# Indexer is used to index (duh!) documents contained in IndexedDirectories 
     2# It can create, update, delete and prune the index, and take care that only 
     3# one IndexWriter exists at any given time, even when used in a multi-threaded 
     4# way. 
    15class Indexer 
    26  # This regexp defines which files should *not* be indexed. 
     
    812 
    913  class << self 
     14    # Finds every document included in IndexedDirectories, parses them with 
     15    # PlainTextExtractor and adds them to the index. 
     16    # 
     17    # Updates the index unless remove_first parameter is set to true, in which 
     18    # case it removes the index first before re-creating it. 
    1019    def index_every_directory(remove_first=false) 
    1120      @@do_not_disturb_while_indexing=true 
     
    2635    end 
    2736 
     37    # Indexes a given directory, using @@threads_number threads. 
     38    # To do so, it retrieves a list of every included document, cuts it in 
     39    # @@threads_number chunks, and create a new indexing thread for every chunk. 
    2840    def index_directory_with_multithreads(dir) 
    2941      log :debug => "Indexing #{dir}, #{@@threads_number} threads" 
     
    4961    end 
    5062 
     63    # Retrieves content and language from a given document, and adds it to the index. 
     64    # Since Document#probably_unique_id is used as index :key, no document will be added 
     65    # twice to the index, and the old document will just get updated. 
     66    # 
     67    # If for some reason (no content found or no defined PlainTextExtractor), content cannot 
     68    # be found, some basic information about the document (mtime, filename, complete_path) 
     69    # gets indexed anyway. 
    5170    def add_or_update_file(complete_path) 
    5271      document = Document.default_fields_for(complete_path) 
     
    7291    def close 
    7392      @@index.close rescue nil 
    74       # Ferret will SEGFAULT otherwise. 
    7593      @@index = nil 
    7694    end 
    77      
    7895     
    7996    # Checks for indexed files that are missing from filesytem 
     
    96113    end 
    97114 
     115    # Creates the index unless it already exists. 
    98116    def ensure_index_existence 
    99117      index_every_directory(:remove_first) unless index_exists? or RAILS_ENV=="production" 
     
    105123    end 
    106124 
     125    # Returns the time at which the index was last created/updated. 
     126    # Returns "none" if it doesn't exist. 
    107127    def last_update 
    108128      Time._load(index_time_dbm_file['last']) rescue "none" 
  • trunk/lib/picolena/templates/lib/core_exts.rb

    r310 r322  
    77 
    88module Enumerable 
     9  # Similar to Enumerable#each, but creates a new thread for each element. 
     10  # Used for the indexer to make it multi-threaded. 
     11  # It ensures that threads are joined together before returning. 
    912  def each_with_thread(&block) 
    1013    tds=self.collect{|elem| 
     
    4245 
    4346class File 
     47  # Returns the filetype of filename as a symbol. 
     48  # Returns :no_extension unless an extension is found 
     49  #  >> File.ext_as_sym("test.pdf") 
     50  #  => :pdf 
     51  #  >> File.ext_as_sym("test.tar.gz") 
     52  #  => :gz 
     53  #  >> File.ext_as_sym("test") 
     54  #  => :no_extension 
    4455  def self.ext_as_sym(filename) 
    4556    File.extname(filename).sub(/^\./,'').downcase.to_sym rescue :no_extension 
    4657  end 
    4758 
     59  # Returns a probable encoding for a given plain text file 
     60  # If source is a html file, it parses for metadata to retrieve encoding, 
     61  # and uses file -i otherwise. 
     62  # Returns iso-8859-15 instead of iso-8859-1, to be sure € char can be 
     63  # encoded 
    4864  def self.encoding(source) 
    4965    parse_for_charset="grep -io charset=[a-z0-9\\-]* | sed 's/charset=//i'" 
     
    6480  end 
    6581 
     82  # Returns the content of a file and removes it after. 
     83  # Could be used to read temporary output file written by a PlainTextExtractor. 
    6684  def self.read_and_remove(filename) 
    6785    content=read(filename) 
     
    6987    content 
    7088  end 
    71    
     89  
     90  # Returns nil unless filename is a plain text file. 
     91  # It requires file command. 
     92  # NOTE: What to use for Win32? 
    7293  def self.plain_text?(filename) 
    7394    %x{file -i "#{filename}"} =~ /: text\//