Skip to main content

Tastypie & Chempi



One of the immediate consequences of refactoring our webservices using Django, Tastypie and related approaches (as described here) is that we can run them on almost any database backend. Django abstracts communication with database and using custom QueryManagers we were able to implement chemisty-specific opererations, such as substructure and similarity search in a database agnostic manner.

This means, that if we want, we can use only Open Source components (such as Postgres and RDKit), or elect to use optimised commercially sourced software as appropriate. However, what if we go one step further and try to use Open Hardware as well? This is exactly what we've just done! We managed to install full ChEMBL 17 on raspbery pi.

Some frequently asked questions (at lease those that have been asked internally) and technical details are below:

1. How much space does it take?

12 Gb, including OS, data and all relevant software. Unfortunately we a used 32 Gb SD card so this is size if you would like to use our cloned disk image.

EDIT: Compressed image takes 4.13 Gb.

2. What OS is it running?

Raspbian, free operating system based on Debian.

3. Is it slow?

We haven't make any benchmarks yet. Obviously it's slower than our online web services - but then it's a lot cheaper. On the other hand, performing some sample requests we can say that performance is certainly acceptable; and there is a lot room for improvements - raspberry pis can be easily overclocked from 700 MHz to 1GHz and according to some benchmarks this can give rise to doubling of application speed in some cases. The SD card we used is not the fastest one as well. Finally, all caching is disabled because we wanted to save disk space but using database caching from Django caching framework should give further major improvements - so maybe use the 32 Gb image after all.

Types of request that chempi can be slower on are:

 - Image generation, but if we replace image with JSON from which image can be generated using HTML5 canvas on the client side (the way we generated images in our game) it can be much faster. More about this topic in future blog post.
- Queries using aggregate functions such as COUNT (it seems that we need to optimise our postgres db by adding some more indexes).
- Substructure and similarity search - again, caching, over-clocking and some database and cartridge (choosing faster fingerprints) optimization should solve all the problems. "Premature optimization is a root of all evil", so we first wanted to have a proof of concept that just works, not necessarily works super fast.

4. Can I make my own chempi?

Yes, we are planning to share our SD card image, we will probably use BitTorrent protocol to do this due to image size, and some issues we have faced with distribution of the myChEMBL. We do remember that not everyone has mega-fast broadband!

5. Is chempi useful at all?

Although we think it is interesting as a proof of concept having chemical database on such small and open source hardware, we do think this may have some interesting future real-world applications:

 - plugging our chempi to local network makes it immediately accessible to other computers. So this is a zero configuration demonstration of ChEMBL.
- analogically to the thesis included in this paper, it can encourage cheminformatics education on low cost ARM hardware.
- raspberry can be easily enhanced with camera to perform image recognition. This, combined with software like OSRA can give ability so scan compound images and search them in database.
- adding some e-ink display (for example, jailbroken Kindle?) can produce very interesting small machine...

6. What are some of the technical details?

To deploy our webservices (which are just another Django application) we've used Gunicorn as a server, which in turn connects to NGINX via standard unix pipe. To make it work as a deamon and launch on machine startup, we've used Supervisor. We believe this is ideal way to deploy Django not only on raspberry but on all production machines to if you like to run chembl webservices locally in your company/academia we suggest to do it this way.


michal

Comments

Popular posts from this blog

New SureChEMBL announcement

(Generated with DALL-E 3 ∙ 30 October 2023 at 1:48 pm) We have some very exciting news to report: the new SureChEMBL is now available! Hooray! What is SureChEMBL, you may ask. Good question! In our portfolio of chemical biology services, alongside our established database of bioactivity data for drug-like molecules ChEMBL , our dictionary of annotated small molecule entities ChEBI , and our compound cross-referencing system UniChem , we also deliver a database of annotated patents! Almost 10 years ago , EMBL-EBI acquired the SureChem system of chemically annotated patents and made this freely accessible in the public domain as SureChEMBL. Since then, our team has continued to maintain and deliver SureChEMBL. However, this has become increasingly challenging due to the complexities of the underlying codebase. We were awarded a Wellcome Trust grant in 2021 to completely overhaul SureChEMBL, with a new UI, backend infrastructure, and new f

A python client for accessing ChEMBL web services

Motivation The CheMBL Web Services provide simple reliable programmatic access to the data stored in ChEMBL database. RESTful API approaches are quite easy to master in most languages but still require writing a few lines of code. Additionally, it can be a challenging task to write a nontrivial application using REST without any examples. These factors were the motivation for us to write a small client library for accessing web services from Python. Why Python? We choose this language because Python has become extremely popular (and still growing in use) in scientific applications; there are several Open Source chemical toolkits available in this language, and so the wealth of ChEMBL resources and functionality of those toolkits can be easily combined. Moreover, Python is a very web-friendly language and we wanted to show how easy complex resource acquisition can be expressed in Python. Reinventing the wheel? There are already some libraries providing access to ChEMBL d

LSH-based similarity search in MongoDB is faster than postgres cartridge.

TL;DR: In his excellent blog post , Matt Swain described the implementation of compound similarity searches in MongoDB . Unfortunately, Matt's approach had suboptimal ( polynomial ) time complexity with respect to decreasing similarity thresholds, which renders unsuitable for production environments. In this article, we improve on the method by enhancing it with Locality Sensitive Hashing algorithm, which significantly reduces query time and outperforms RDKit PostgreSQL cartridge . myChEMBL 21 - NoSQL edition    Given that NoSQL technologies applied to computational chemistry and cheminformatics are gaining traction and popularity, we decided to include a taster in future myChEMBL releases. Two especially appealing technologies are Neo4j and MongoDB . The former is a graph database and the latter is a BSON document storage. We would like to provide IPython notebook -based tutorials explaining how to use this software to deal with common cheminformatics p

Multi-task neural network on ChEMBL with PyTorch 1.0 and RDKit

  Update: KNIME protocol with the model available thanks to Greg Landrum. Update: New code to train the model and ONNX exported trained models available in github . The use and application of multi-task neural networks is growing rapidly in cheminformatics and drug discovery. Examples can be found in the following publications: - Deep Learning as an Opportunity in VirtualScreening - Massively Multitask Networks for Drug Discovery - Beyond the hype: deep neural networks outperform established methods using a ChEMBL bioactivity benchmark set But what is a multi-task neural network? In short, it's a kind of neural network architecture that can optimise multiple classification/regression problems at the same time while taking advantage of their shared description. This blogpost gives a great overview of their architecture. All networks in references above implement the hard parameter sharing approach. So, having a set of activities relating targets and molecules we can tra

ChEMBL 26 Released

We are pleased to announce the release of ChEMBL_26 This version of the database, prepared on 10/01/2020 contains: 2,425,876 compound records 1,950,765 compounds (of which 1,940,733 have mol files) 15,996,368 activities 1,221,311 assays 13,377 targets 76,076 documents You can query the ChEMBL 26 data online via the ChEMBL Interface and you can also download the data from the ChEMBL FTP site . Please see ChEMBL_26 release notes for full details of all changes in this release. Changes since the last release: * Deposited Data Sets: CO-ADD antimicrobial screening data: Two new data sets have been included from the Community for Open Access Drug Discovery (CO-ADD). These data sets are screening of the NIH NCI Natural Product Set III in the CO-ADD assays (src_id = 40, Document ChEMBL_ID = CHEMBL4296183, DOI = 10.6019/CHEMBL4296183) and screening of the NIH NCI Diversity Set V in the CO-ADD assays (src_id = 40, Document ChEMBL_ID = CHEMBL4296182, DOI = 10.601