A Quantitative and Qualitative Comparison of Machine Learning Inference Frameworks

2024 Nov 17

A Quantitative and Qualitative Comparison of Machine Learning Inference Frameworks

paper

This paper covers a study on the performance of various machine learning model inference runtimes including Tenserflow Serving and PyTorch Serve as well as NVIDIA Triton with either of these as the backend.
The paper was written by Egi Brako based on his bachelor’s thesis, which he had written under my guidance and supervision.
As a lecturer it is amazing to one of your students go from writing their thesis to a scientific publication about it.
While this is not too uncommon for master’s theses, it is a rare and amazing feat to do so based on a bachelor’s thesis.

The paper itself compares the above mentioned inference runtimes for different types of machine learning models including image classification, speech recognition and text summarization.
The comparison is based on the throughput rate both for single and multi stream use cases as well as how the runtimes handle high amounts of simultaneous user requests.
Additionally, the study also considered the usability of each runtime based on how difficult or easy it was to operate during the experiments and how compatible it was with the different use cases.

The result of the study was that PyTorch Serve and NVIDIA Triton with PyTorch as its backend showed the strongest performance across all use cases.
However, PyTorch Serve has since been archived and is no longer being updated and therefore not recommended as security issues are not fixed anymore.
Therefore, employing NVIDIA Triton with PyTorch is recommended.
Nevertheless, it should be noted that while Tensorflow Serving showed the weakest performance, it is still actively being developed and improved such that the performance measured in the paper might not be representative of the newest Tensorflow Serving release.

To see all the details, it is recommended to read the paper, which can be found here: https://www.thinkmind.org/library/SCALABILITY/SCALABILITY_2024/scalability_2024_1_20_20010.html

Next Post Previous Post

Search

Random Article

I'm Feeling Lucky!

Popular Tags

paper talk project thesis

About Me

Dr. Jonathan Decker

Jonathan is a scientific employee of the Georg-August-University of Göttingen and a postdoc researcher. He takes the role of a system architect and is focused on designing systems that enable new and novel ways of utilizing Cloud and HPC resources, while also being efficient, secure and scalable. Most notably, he strives to combine HPC with Kubernetes. In addition to conducting research in these topics, he handles university teaching activities.

inbox@dr-decker.example.science

Feed

Atom 1.0 RSS

Dr. Decker