Nicolas Blin

About

My name is Nicolas Blin, I am a DevTech Engineer at NVIDIA.

My research interests involve algorithms parallelization on GPU and metaheuristic algorithms.

Read my resume

Work experience

I had the chance to do both my engineering internship in GPU computing

A first one at the 15-20 hospital in Paris:
I was able to work with CNRS researcher Michael Atlan on a medical imaging software. Holovibes allows visualizing in real-time the blood circulating in the veins of the retina in order to detect certain eye diseases.
My main task during this internship was to optimize the GPU calculation pipeline using C++ & CUDA to be able to manage a larger flow of images and thus be able to better detect diseases.
All the optimizations I made allowed the software to gain an x20 speed factor and thus go from an average processing of 500 fps to 10000 fps, allowing to finally visualize the blood circulation in real-time.

The second one was at NVIDIA in the cuOpt team in Paris.
I was supervised by Alex Fender. I worked on building a completely new PDP solver algorithm on GPU using C++ & CUDA. I had to write & optimize the solver and parallelize a new metaheuristic algorithm on GPU. I had the great honor to work with DevTech engineer Akif Çördük and VRP world record breaker Piotr Sielski.
Together, we built the fastest PDP solver on the market.

Now, I'm beginning my career at NVIDIA as a DevTech engineer.
I will keep on working on cuOpt and on various other projects.

Scientific research

I got the opportunity to be one of the 9 students selected at the EPITA Research & Development Laboratory (LRDE). I was thus able to follow in parallel of my image processing courses the RDI curriculum dedicated to research. This includes working daily with a researcher and publishing articles.

My field of study was GPU parallelization of the Max-tree algorithm using C++ and CUDA.

I was able to reach x10 speed-up compared to the best multi-threaded algorithm and thus achieve the desired real-time goal.

Algorithm details and results were bundled in a scientific paper written with my supervisor Edwin Carlinet and published in TPDS (IEEE Transactions on Parallel and Distributed Systems), A* scientific journal.

Read the paper

Teaching

I created an advanced GPU programming course for last year students (master 2) of the major IMAGE at EPITA.
Several topics are covered during the course:

  • Understanding GPU hardware in greater depth
  • Knowing how to use Nsight Compute / Systems
  • Mastering the Reduce pattern and all related advanced optimizations
  • Know how to program and use the Decoupled Look-back technique as part of the Scan pattern
  • Ability to code and optimize Histogram and Radix Sort algorithms
  • Being able to handle the key tools for a cleaner CUDA programming: CUB, Thrust, cuda-gdb, Compute Sanitizer...
The course is divided into 6 classes of 3 hours, including 4 lectures and 2 hands-on sessions. Assessment is project-based: Students have to create an optimized pipeline to repair images in real-time. Repairing images involves Reduce, Scan, Radix Sort, Histogram... The goal is to program those algorithms in the most optimized way possible and analyze performance on Nsight Compute and Systems.

Projects

Cuda-Ray-Tracer
Ray Tracer written in C++/CUDA enabling real-time photorealistic rendering of multiple shapes with lights, shadows, and mirrors.
prophecy
A deep neural networks framework similar to Keras and written from scratch in C++/CUDA.
42sh
Fully functional shell in C. Variables, paths, globbing, jobs... handled. Interpreting done with an LL parser and an AST
chess
Chess engine and AI written in C++ to compete with other students.
hsi-classification
Machine learning pipeline to classify hyperspectral images of fields.
keras-ships-classifications
Deep learning methods to classify images of ships.
Malloc
Reprogramming of the memory C allocation tool (malloc, realloc, free)
pylene-python-bindings
Python bindings for Pylene, a C++ image processing library.
cpp-concurrent-dictionary
Concurrent dictionary with one implementation based on concurrent hash maps and one based on concurrent tries.