/ /
raxes : review & rework of searx
Drag image to reposition
raxes : review & rework of searx
Last update
Jan 12, 2021
Time spent (hours)
searx is privacy-respecting metasearch engine. It is search engine that does not have its own index, but rely on other search engines to deliver its results. It rely on Flask, requests and lxml.
Here is sloccount output:
SLOC Directory SLOC-by-Language (Sorted) 14470 searx python=9696,javascript=3703,xml=1071 3695 utils sh=3099,python=596 1512 tests python=1512 210 top_dir sh=142,python=68 107 dockerfiles sh=107 107 docs python=107 18 examples python=18 Totals grouped by language (dominant language first): python: 11997 (59.63%) javascript: 3703 (18.41%) sh: 3348 (16.64%) xml: 1071 (5.32%) Total Physical Source Lines of Code (SLOC) = 20,119 Development Effort Estimate, Person-Years (Person-Months) = 4.68 (56.10) (Basic COCOMO model, Person-Months = 2.4 * (KSLOC**1.05)) Schedule Estimate, Years (Months) = 0.96 (11.55) (Basic COCOMO model, Months = 2.5 * (person-months**0.38)) Estimated Average Number of Developers (Effort/Schedule) = 4.86 Total Estimated Cost to Develop = $ 631,580 (average salary = $56,286/year, overhead = 2.40).
Here is the first lines of git summary:
project : searx repo age : 7 years active : 1091 days commits : 3805 files : 709 authors : 1200 Adam Tauber 31.5% 453 Markus Heiser 11.9% 422 asciimoo 11.1% 336 Alexandre Flament 8.8% 202 Thomas Pointhuber 5.3% 200 Noémi Ványi 5.3% 165 Cqoicebordel 4.3% 77 Dalf 2.0% 54 Marc Abonce Seguin 1.4% 52 Noemi Vanyi 1.4% 44 a01200356 1.2% 42 dalf 1.1% 39 marc 1.0%
Here is the dependencies:
babel==2.9.0 certifi==2020.12.05 flask==1.1.2 flask-babel==2.0.0 idna==2.10 jinja2==2.11.2 lxml==4.6.2 pygments==2.1.3 pyopenssl==20.0.1 python-dateutil==2.8.1 pyyaml==5.3.1 requests[socks]==2.25.1
According to the Makefile, the entry point of the web application, is ./searx/webapp.py. The main problem with searx is that it is not easy to use it as library.
Looking up @app.route yield only 13 routes:
webapp.py 549:@app.route('/', methods=['GET', 'POST']) 568:@app.route('/search', methods=['GET', 'POST']) 753:@app.route('/about', methods=['GET']) 761:@app.route('/autocompleter', methods=['GET', 'POST']) 812:@app.route('/preferences', methods=['GET', 'POST']) 894:@app.route('/image_proxy', methods=['GET']) 942:@app.route('/stats', methods=['GET']) 952:@app.route('/stats/errors', methods=['GET']) 980:@app.route('/robots.txt', methods=['GET']) 991:@app.route('/opensearch.xml', methods=['GET']) 1014:@app.route('/favicon.ico') 1025:@app.route('/clear_cookies') 1033:@app.route('/config') 1082:@app.route('/translations.js')
Let's look through each from least interesting to the most clever, that is approximately a bottom-up read.
Here is the full content of that route:
@app.route('/translations.js') def js_translations(): return render( 'translations.js.tpl', override_theme='__common__', ), {'Content-Type': 'text/javascript; charset=UTF-8'}
There is two small issues with this code:
It is easier to debug code when return is followed by a simple variable e.g. return foobar. Here things are made worse, because the returned value span multiple lines.
the return is multiple line statement. That requires to zigzag the code. Instead of reading top-down left-right you need to zigzag and reconfigure your brain to also do read down-top (and even right-left in more problematic cases).
The use of underscores _ in variable names is odd. In this case, it is a way to avoid to think and overload the reader with useless and poorly named variables names. I do that a lot with Scheme, but with Scheme you can use more readable and better looking characters for naming variables. underscore is difficult to read, and sometime it disappears because of poor resolution or image scaling. The following code:
_engines = [] for name, engine in engines.items(): something = process(name, engine) _engines.append(something)
Can be reworked into a list comprehension:
engines = [process(name, engine) for (name, engine) in engines.items()]
That much more obvious that one can use a list comprehension instead of the following code:
_plugins = [] for _ in plugins: _plugins.append({'name': _.name, 'enabled': _.default_on})
Mind the use of _ as variable name. When used as variable name placeholder, _ should not be accessed. It would be trivial to replace _ with item and bundle it inside a list comprehension.
That is a complex one-liner:
@app.route('/favicon.ico') def favicon(): return send_from_directory(os.path.join(app.root_path, static_path, 'themes', get_current_theme_name(), 'img'), 'favicon.png', mimetype='image/vnd.microsoft.icon')
I will not repeat that multi-line return statements are difficult to read.
In that case, the code is trivial enough, but nesting calls is a evil habit. Things like:
out = qwe(asd(zxc(iop)), jkl(bnm))
Are not only difficult to read (even with normal variable names) but also more complex to debug, because the intermediate an result does not get their own variable.
That is nice code. I like the following pattern that is used twice in that function:
out = default_value if not nominal: out = something
It is odd to see the HTTP method spelled lower case but that might be something specific to opensearch. Otherwise I try to avoid shortcut variables names in that function ret and response could both be renamed out since it is the output of the function.
Not very interesting, except maybe it would be easier to create a global constant (and in some case, but prolly not here, use an external text file).
Again, factoring the body of the for and using a list comprehension could be nice.
The following code:
foobar = list(something.keys()) foobar.sort()
That is equivalent to:
foobar = sorted(something.keys())
When you use the key keyword argument in list.sort, sorted or others, you might want to rely on the standard library operators.
Nothing interesting to say about this function.
The following code:
headers = dict_subset(request.headers, {'If-Modified-Since', 'If-None-Match'})
Can be rewritten to be more readable with a comprehension:
headers = {key: headers[key] for key in headers if key in ['foo', 'bar']}
I still do not understand why everybody use the variable name logger.
The following:
counter = 0 for item in foobar: counter += 1 do_something(counter, item)
Can be rewritten:
for counter, item in enumrate(foobar): do_something(counter, item)
enumerate is builtin function, hence always available.
I will not repeat what I already wrote about comprehensions. There is good pattern in there. But the last line of the function kills everything:
return render('preferences.html', selected_categories=get_selected_categories(request.preferences, request.form), all_categories=_get_ordered_categories(), locales=settings['locales'], current_locale=request.preferences.get_value("locale"), image_proxy=image_proxy, engines_by_category=engines_by_category, stats=stats, answerers=[{'info': a.self_info(), 'keywords': a.keywords} for a in answerers], disabled_engines=disabled_engines, autocomplete_backends=autocomplete_backends, shortcuts={y: x for x, y in engine_shortcuts.items()}, themes=themes, plugins=plugins, doi_resolvers=settings['doi_resolvers'], current_doi_resolver=get_doi_resolver(request.args, request.preferences.get_value('doi_resolver')), allowed_plugins=allowed_plugins, theme=get_current_theme_name(), preferences_url_params=request.preferences.get_as_url_params(), base_url=get_base_url(), locked_preferences=locked_preferences, preferences=True)
The above is already too tall for function body!
An idea here about how to avoid to use a class and sadly hide the interesting logic away from the where the action happens:
RawTextQuery return an object with a public interface that has 4 methods include some that have side-effects like the following call to changeQuery:
for result in raw_results: raw_text_query.changeQuery(result) # add parsed result results.append(raw_text_query.getFullQuery())
That is necessarily a side-effect because it does not assign a variable otherwise it would be dead-code. Here raw_text_query is mutated, mutable objects are difficult to debug.
Following the spirit of the code, the function searx_bang could be a method of RawTextQuery especially since it is used only once in the whole code base.
Anyway the full function could rewritten as follow:
def autocompleter(): """Return autocompleter results""" query = request.form.get('q', '') try: query, parts, languages, specific = query_parse(query) except BadQuery: return '', 400 # parse searx specific autocompleter results like !bang hits = searx_bang(query, parts, languages, specific) backend_name = request.preferences.get_value('autocomplete') try: completer = autocomplete_backends[backend_name] except KeyError: pass else: if not hits and (len(parts) > 1 or (len(languages) == 0 and not specific)): # get language from cookie language = request.preferences.get_value('language') language = language.split('-')[0] if (language or language == 'all') else 'en' # run autocompletion more = completer(query, language) hits += more # parse results (write :language and !engine back to result string) hits = [do_something(hit) for hit in hits] response = hits_to_response(request, hits) return reponse
Mind the fact:
I dropped the disabled_engine variable, because I am not sure where it is useful.
It is not clear what happens in the last for loop especially with the mutated raw query, so I replaced the whole thing with a comprehension and factored the body in a function called do_something.
But that is not everything we can do. To make the project usable as a library, it will be nice to extract the logic and keep environment specifics in the flask view.
If you do it yourself, you might end up with something like that as the view function autocompleter:
@app.route('/autocompleter', methods=['GET', 'POST']) def autocompleter(): """Return autocompleter results""" query = request.form.get('q', '') if query.isspace(): return 'thanks, but no thanks!', 400 backend_name = request.preferences.get_value('autocomplete') language = request.preferences.get_value('language') language = language.split('-')[0] if (language or language == 'all') else 'en' try: hits = autocomplete(query, backend_name, language) except BadQuery: return 'invalid query because...', 400 response = hits_to_response(request, hits) return reponse
It is a static page so not much to say, except the indentation is not good.
That is the gist of the project. Here is the big problem: validation, logic and rendering is mixed into a giant view function, factorization was done, but there is room for more especially regarding the output generation.
Here is the code that execute the meta search:
# search search_query = None raw_text_query = None result_container = None try: search_query, raw_text_query, _, _ = get_search_query_from_webapp(request.preferences, request.form) # search = Search(search_query) # without plugins search = SearchWithPlugins(search_query, request.user_plugins, request) result_container = search.search() except SearxParameterException as e: logger.exception('search error: SearxParameterException') return index_error(output_format, e.message), 400 except Exception as e: logger.exception('search error') return index_error(output_format, gettext('search error')), 500
There is no point into defining as None the first three variables since it is useless without an actual result, the code that follow expect something that is not None.
Something that is always odd: multiples statements inside the try block. At the very least, it should be something along the line of:
try: try: search_query, raw_text_query, _, _ = get_search_query_from_webapp(request.preferences, request.form) except SearxParameterException as e: return index_error(...) except Exception as e: return index_error(...)
Note that the output, this time is inside a single function called index_error.
Another possible approach would be flat try / except like:
try: query = do_something() except Exception: return error(...) try: hits = do_continue() except Exception: return error(...) return hits_to_response(hits)
The body of the original try / except block does not look good:
search_query, raw_text_query, _, _ = get_search_query_from_webapp(request.preferences, request.form) search = SearchWithPlugins(search_query, request.user_plugins, request) result_container = search.search()
SearchWithPlugins is used only once in the whole code base and the only public method is search. That begs to become a function, it will make clear how do a search!
How meta search works wit searx?
After jumping around definitions I end in the class Search at the method search_multiple_requests:
def search_multiple_requests(self, requests): search_id = uuid4().__str__() for engine_name, query, request_params in requests: th = threading.Thread( target=processors[engine_name].search, args=(query, request_params, self.result_container, self.start_time, self.actual_timeout), name=search_id, ) th._timeout = False th._engine_name = engine_name th.start() for th in threading.enumerate(): if th.name == search_id: remaining_time = max(0.0, self.actual_timeout - (time() - self.start_time)) th.join(remaining_time) if th.is_alive(): th._timeout = True self.result_container.add_unresponsive_engine(th._engine_name, 'timeout') logger.warning('engine timeout: {0}'.format(th._engine_name))
It does trigger simultaneously, using threads, a search query against a search engine based on user preference and the last for block will retrieve the result under a timeout. That is if a search engine does not reply under less that some configured time, it is considered an error.
The part that interest me is:
After some jumping around, I find the mighty directory searx/engines that contains all the logic to query and scrape results 😇.
This is a very quick rework of searx, one might call it a tribute to searx since it is very far from as feature parity.