Changeset 322
- Timestamp:
- 05/05/08 03:22:00 (7 months ago)
- Files:
-
- trunk/lib/picolena/templates/app/models/indexer.rb (modified) (7 diffs)
- trunk/lib/picolena/templates/lib/core_exts.rb (modified) (4 diffs)
Legend:
- Unmodified
- Added
- Removed
- Modified
- Copied
- Moved
trunk/lib/picolena/templates/app/models/indexer.rb
r321 r322 1 # Indexer is used to index (duh!) documents contained in IndexedDirectories 2 # It can create, update, delete and prune the index, and take care that only 3 # one IndexWriter exists at any given time, even when used in a multi-threaded 4 # way. 1 5 class Indexer 2 6 # This regexp defines which files should *not* be indexed. … … 8 12 9 13 class << self 14 # Finds every document included in IndexedDirectories, parses them with 15 # PlainTextExtractor and adds them to the index. 16 # 17 # Updates the index unless remove_first parameter is set to true, in which 18 # case it removes the index first before re-creating it. 10 19 def index_every_directory(remove_first=false) 11 20 @@do_not_disturb_while_indexing=true … … 26 35 end 27 36 37 # Indexes a given directory, using @@threads_number threads. 38 # To do so, it retrieves a list of every included document, cuts it in 39 # @@threads_number chunks, and create a new indexing thread for every chunk. 28 40 def index_directory_with_multithreads(dir) 29 41 log :debug => "Indexing #{dir}, #{@@threads_number} threads" … … 49 61 end 50 62 63 # Retrieves content and language from a given document, and adds it to the index. 64 # Since Document#probably_unique_id is used as index :key, no document will be added 65 # twice to the index, and the old document will just get updated. 66 # 67 # If for some reason (no content found or no defined PlainTextExtractor), content cannot 68 # be found, some basic information about the document (mtime, filename, complete_path) 69 # gets indexed anyway. 51 70 def add_or_update_file(complete_path) 52 71 document = Document.default_fields_for(complete_path) … … 72 91 def close 73 92 @@index.close rescue nil 74 # Ferret will SEGFAULT otherwise.75 93 @@index = nil 76 94 end 77 78 95 79 96 # Checks for indexed files that are missing from filesytem … … 96 113 end 97 114 115 # Creates the index unless it already exists. 98 116 def ensure_index_existence 99 117 index_every_directory(:remove_first) unless index_exists? or RAILS_ENV=="production" … … 105 123 end 106 124 125 # Returns the time at which the index was last created/updated. 126 # Returns "none" if it doesn't exist. 107 127 def last_update 108 128 Time._load(index_time_dbm_file['last']) rescue "none" trunk/lib/picolena/templates/lib/core_exts.rb
r310 r322 7 7 8 8 module Enumerable 9 # Similar to Enumerable#each, but creates a new thread for each element. 10 # Used for the indexer to make it multi-threaded. 11 # It ensures that threads are joined together before returning. 9 12 def each_with_thread(&block) 10 13 tds=self.collect{|elem| … … 42 45 43 46 class File 47 # Returns the filetype of filename as a symbol. 48 # Returns :no_extension unless an extension is found 49 # >> File.ext_as_sym("test.pdf") 50 # => :pdf 51 # >> File.ext_as_sym("test.tar.gz") 52 # => :gz 53 # >> File.ext_as_sym("test") 54 # => :no_extension 44 55 def self.ext_as_sym(filename) 45 56 File.extname(filename).sub(/^\./,'').downcase.to_sym rescue :no_extension 46 57 end 47 58 59 # Returns a probable encoding for a given plain text file 60 # If source is a html file, it parses for metadata to retrieve encoding, 61 # and uses file -i otherwise. 62 # Returns iso-8859-15 instead of iso-8859-1, to be sure ⬠char can be 63 # encoded 48 64 def self.encoding(source) 49 65 parse_for_charset="grep -io charset=[a-z0-9\\-]* | sed 's/charset=//i'" … … 64 80 end 65 81 82 # Returns the content of a file and removes it after. 83 # Could be used to read temporary output file written by a PlainTextExtractor. 66 84 def self.read_and_remove(filename) 67 85 content=read(filename) … … 69 87 content 70 88 end 71 89 90 # Returns nil unless filename is a plain text file. 91 # It requires file command. 92 # NOTE: What to use for Win32? 72 93 def self.plain_text?(filename) 73 94 %x{file -i "#{filename}"} =~ /: text\//
