I have written several blogs related to watermarking.

The example Python code provided in these articles performs RGB operations on all or some pixels of an image. If it is a 2048 x 1152 size image, at least Width x Height x Channel = 7,077,888 calculations are required. And if the video is 30 frames per second, 212,336,640 operations per second are required.

The Python language is easy to learn, and modules with numerous functions are provided, so high code productivity can be expected. However, if the number of repetitions of loop statements such as for and while increases, the performance decreases a lot. The reason is that because the Python language is interpreted, it reinterprets every sentence in every loop.

Let's compare the performance by making the following simple example in C and Python.

This simple example adds 1 to 10000000.

#include <stdio.h> // for printf()
#include <sys/time.h> // for clock_gettime()

void loop_func(){
  unsigned long sum = 0;
  for(int x = 0; x < 10000000; x++){
    sum += x;
  }
  printf("SUM[0 ~ 10000000] is %ld\n", sum);
  return;
}

int main() {
    struct timeval start, end;
    long secs_used,micros_used;

    gettimeofday(&start, NULL);
    loop_func();
    gettimeofday(&end, NULL);

    secs_used=(end.tv_sec - start.tv_sec); //avoid overflow by subtracting first
    micros_used= ((secs_used*1000000) + end.tv_usec) - (start.tv_usec);
    float secs_elapsed = micros_used * 1.0 / 1000000;
    printf("micros_used: %ld\n",micros_used);
    printf("sec used: %10.6f\n",secs_elapsed);
    return 0;
}

<for.c>

Now compile and run the code.

root@ubuntusrv:/usr/local/src/study/tmp# gcc for.c 
root@ubuntusrv:/usr/local/src/study/tmp# ./a.out
SUM[0 ~ 10000000] is 49999995000000
micros_used: 19571
sec used:   0.019571

Using an executable program made in C language and compiled, it takes 0.019571 seconds to add up to 1 ~ 10000000 on my computer.

This time, I'll test it with Python code that does the same thing.

import time

def loop_func():
    sum = 0
    for x in range(10000000):
        sum += x
    print("SUM[0 ~ 10000000] is %ld"%sum);    
    
    
start = time.time()
loop_func()
end = time.time()
print("sec used: %10.6f"%(end - start))

<for.py>

Now run the code.

root@ubuntusrv:/usr/local/src/study/tmp# python3 for.py 
SUM[0 ~ 10000000] is 49999995000000
sec used:   0.950677

This time it took 0.950677 seconds. A program written in C is about 49 times faster.

This figure clearly depends on the content of the program. As in the example above, if there are many simple loop statements, there is a large difference in speed. However, using a modular math module like numpy reduces the speed difference. The reason is that numpy is written in c, so it works quickly inside the numpy function.

However, you can also run partially compiled code in Python. In the above case, if code made in pure Python language or some modules such as numpy are used, some code can be pre-compiled and then executed using the numba module using the JIT (Just in Time) compiler(llvmlite).

Install numba and compare performance

First, install numba with the pip3 command.

root@ubuntusrv:/usr/local/src/study/tmp# pip3 install numba
Collecting numba
  Downloading numba-0.53.0-cp38-cp38-manylinux2014_x86_64.whl (3.4 MB)
     |████████████████████████████████| 3.4 MB 2.0 MB/s 
Collecting llvmlite<0.37,>=0.36.0rc1
  Downloading llvmlite-0.36.0-cp38-cp38-manylinux2010_x86_64.whl (25.3 MB)
     |████████████████████████████████| 25.3 MB 383 kB/s 
Collecting numpy>=1.15
  Downloading numpy-1.20.1-cp38-cp38-manylinux2010_x86_64.whl (15.4 MB)
     |████████████████████████████████| 15.4 MB 7.3 MB/s 
Requirement already satisfied: setuptools in /usr/lib/python3/dist-packages (from numba) (45.2.0)
Installing collected packages: llvmlite, numpy, numba
Successfully installed llvmlite-0.36.0 numba-0.53.0 numpy-1.20.1

Now, let's compile the simple iteration function loop_func of the for.py code above using numba and run it.

There are many features in numba, but first, let's use only the easiest to use.

Import jit from the numba module and add decorators to the function you want to JIT compilation. The detailed syntax for this decorator is described in detail in the Numba documentation.

import time
from numba import jit

@jit(nopython=True, cache=True)
def loop_func():
    sum = 0
    for x in range(10000000):
        sum += x
    print("SUM[0 ~ 10000000] is " + str(sum));    
    
    
start = time.time()
loop_func()
end = time.time()
print("sec used: %10.6f"%(end - start))

<for_numba.py>

When using numba, I get an error when I use the format print function.

print("SUM[0 ~ 10000000] is %ld"%sum);

therefore the format was modified as.

print("SUM[0 ~ 10000000] is "+ str(sum));

If you run the code, there is no difference in the first run. But from the second you can see that it is about 25 times faster. The reason is that the JIT compiler is running at the time of first execution, creating compiled executable code, and using this code from the second time on. This is possible because we used the cache=True option in the decorator.

If this value is set to False, performance can be improved when the JIT-compiled function is repeatedly executed in one execution.

root@ubuntusrv:/usr/local/src/study/tmp# python3 for_numba.py 
SUM[0 ~ 10000000] is 49999995000000
sec used:   1.026282
root@ubuntusrv:/usr/local/src/study/tmp# python3 for_numba.py 
SUM[0 ~ 10000000] is 49999995000000
sec used:   0.046993

Now there is only about twice the performance difference from the C language.

Improving pixel operation speed using Numba

Image Processing #5-Let's compare the performance using the example of WaterMark using alpha channel.

I have added a print statement in the example code and a part that prints the time used for watermarking.

import argparse
import cv2
import numpy as np
import time
'''
This function remains background
rate should be : 0 < rate < 1.0
'''
def process_alpha_masking(base, mask, pos):
    h, w, c = mask.shape
    hb, wb, _ = base.shape
    x = pos[0]
    y = pos[1]

    #check mask position
    if(x > wb or y > hb):
        print(' invalid overlay position (' + str(x) + ',' + str(y) + ')')
        return None
    
    #remove alpha channel    
    if c != 4:
        print('mask image file does not have alpha channel')
        return None
    
    #adjust mask
    if(x + w > wb):
        mask = mask[:, 0:wb - x]
        print(' mask X size adjust W:' + str(w) + ' -> W:' + str(wb - x))
    if(y + h > hb):
        mask = mask[0:hb - y, :]
        print(' mask Y size adjust H:' + str(h) + ' -> H:' + str(hb - y))

    h, w, c = mask.shape
    
    img = base.copy()
    bg = img[y:y+h, x:x+w]      #overlay area
    for i in range(0, h):
        for j in range(0, w):
            B = mask[i][j][0]
            G = mask[i][j][1]
            R = mask[i][j][2]
            alpha = mask[i][j][3] * 1.0 / 255.0
            if (alpha > 0.0):
                bg[i][j][0] = int(B * alpha + bg[i][j][0] * (1 - alpha))
                bg[i][j][1] = int(G * alpha + bg[i][j][1] * (1 - alpha))
                bg[i][j][2] = int(R * alpha + bg[i][j][2] * (1 - alpha))
    img[y:y+h, x:x+w] = bg
    return img


parser = argparse.ArgumentParser(description="OpenCV Example")
parser.add_argument("--file", type=str, required=True, help="filename of the input image to process")
parser.add_argument("--mask", type=str, required=True, help="mask image to overlay")
args = parser.parse_args()

img = cv2.imread(args.file, cv2.IMREAD_COLOR)
height, width, channels = img.shape
print("image   H:%d W:%d, Channel:%d"%(height, width, channels))
cv2.imwrite('/tmp/original.jpg', img)

mark = cv2.imread(args.mask, cv2.IMREAD_UNCHANGED)
mheight, mwidth, mchannels = mark.shape
print("mask   H:%d W:%d, Channel:%d"%(mheight, mwidth, mchannels))

x = np.amin( [mwidth, width])
y = np.amin( [mheight, height])
start = time.time()
for x in range(10):
    new_img = process_alpha_masking(img, mark, (x, y))
end = time.time()
print("sec used: %10.6f"%(end - start))
if new_img is not None :
    cv2.imwrite('/tmp/masked.jpg', new_img)

<watermark_alpha.py>

And the following is the code using numba. Except for the decorator, nothing has changed.

import argparse
import cv2
import numpy as np
import time
from numba import jit
'''
This function remains background
rate should be : 0 < rate < 1.0
'''
@jit(nopython=True, cache=True)
def process_alpha_masking(base, mask, pos):
    h, w, c = mask.shape
    hb, wb, _ = base.shape
    x = pos[0]
    y = pos[1]

    #check mask position
    if(x > wb or y > hb):
        print(' invalid overlay position (' + str(x) + ',' + str(y) + ')')
        return None
    
    #remove alpha channel    
    if c != 4:
        print('mask image file does not have alpha channel')
        return None
    
    #adjust mask
    if(x + w > wb):
        mask = mask[:, 0:wb - x]
        print(' mask X size adjust W:' + str(w) + ' -> W:' + str(wb - x))
    if(y + h > hb):
        mask = mask[0:hb - y, :]
        print(' mask Y size adjust H:' + str(h) + ' -> H:' + str(hb - y))

    h, w, c = mask.shape
    
    img = base.copy()
    bg = img[y:y+h, x:x+w]      #overlay area

    for i in range(0, h):
        for j in range(0, w):
            B = mask[i][j][0]
            G = mask[i][j][1]
            R = mask[i][j][2]
            alpha = mask[i][j][3] * 1.0 / 255.0
            if (alpha > 0.0):
                bg[i][j][0] = int(B * alpha + bg[i][j][0] * (1 - alpha))
                bg[i][j][1] = int(G * alpha + bg[i][j][1] * (1 - alpha))
                bg[i][j][2] = int(R * alpha + bg[i][j][2] * (1 - alpha))
    img[y:y+h, x:x+w] = bg
    return img


parser = argparse.ArgumentParser(description="OpenCV Example")
parser.add_argument("--file", type=str, required=True, help="filename of the input image to process")
parser.add_argument("--mask", type=str, required=True, help="mask image to overlay")
args = parser.parse_args()

img = cv2.imread(args.file, cv2.IMREAD_COLOR)
height, width, channels = img.shape
print("image   H:%d W:%d, Channel:%d"%(height, width, channels))
cv2.imwrite('/tmp/original.jpg', img)

mark = cv2.imread(args.mask, cv2.IMREAD_UNCHANGED)
mheight, mwidth, mchannels = mark.shape
print("mask   H:%d W:%d, Channel:%d"%(mheight, mwidth, mchannels))

x = np.amin( [mwidth, width])
y = np.amin( [mheight, height])
start = time.time()
for x in range(10):
    new_img = process_alpha_masking(img, mark, (x, y))
end = time.time()
print("sec used: %10.6f"%(end - start))
if new_img is not None :
    cv2.imwrite('/tmp/masked.jpg', new_img)

<watermark_alpha_numba.py>

Now let's compare the performance of the two codes.

root@ubuntusrv:/usr/local/src/study/tmp# python3 watermark_alpha_numba.py --file=biden.jpg --mask=watermark.png
image   H:727 W:320, Channel:3
mask   H:32 W:100, Channel:4
sec used:   1.818784
root@ubuntusrv:/usr/local/src/study/tmp# python3 watermark_alpha_numba.py --file=biden.jpg --mask=watermark.png
image   H:727 W:320, Channel:3
mask   H:32 W:100, Channel:4
sec used:   0.044184
root@ubuntusrv:/usr/local/src/study/tmp# python3 watermark_alpha.py --file=biden.jpg --mask=watermark.png
image   H:727 W:320, Channel:3
mask   H:32 W:100, Channel:4
sec used:   0.333172

When using numba, it took a lot of time because the JIT compiler works on the first run, but from the second you can see that it works quite quickly. Compared to the case where numba is not used, the performance difference is about 7.5 times.

Both programs make watermarking as follows.

Wrapping Up

If you apply JIT compilation using numba to functions that use a lot of loop statements in Python code, you can see a significant performance improvement.

However, numba is not applicable in all cases. Due to the nature of JIT, it is often impossible to apply it.

You can refer to the Numba documentation to learn a wide range of numba features and uses.

이 블로그 검색

OpenCV Cooking

Image Processing #11 - Speed up image processing with numba

Install numba and compare performance

Improving pixel operation speed using Numba

Wrapping Up

댓글

댓글 쓰기

이 블로그의 인기 게시물

Image Processing #7 - OpenCV Text

Image Processing #5 - WaterMark using alpha channel

OpenCV Installation - Rasbian Buster, Jessie, DietPi Buster