Extremely high CPUs usage with GPUs offload

paveltonev · January 5, 2024, 3:40pm

Hi,

Test of CasparCG server (v2.4.0-beta1) are executed on two-socket server with 20 cores(40 threads) and two GPUs Quadro M4000 - 8GB ram.

Once the server is started with default configuration - CPUs are close to 0% utilization.
Add one udp multicast stream with arguments: -format mpegts -codec:v h264_nvenc -codec:a aac - all 40 CPUs have usage of around 35-40% and one of the GPU is used for the stream encoding with around 10% utilization.
PLAY a mp4 clip - all 40 CPUs have usage of above 85% and GPU continues to encode with around 10% utilization.

Is this expected behavior?

Thank you!

Julusian · January 6, 2024, 8:52pm

All I can say on this is that I agree that those numbers seem high.

But also I am not sure how well caspar runs on machines dual socket or dual gpu machines. It has not been written with machines like this in mind, so it would not surprise me if it performs terribly due to memory constantly being in the ‘wrong’ place (ie local to the other cpu, causing accesses to be really inefficient). I have no data to prove or disprove anything though.
For compositing, we always use just the ‘first’ gpu. I don’t know how ffmpeg will choose the gpu to use for encoding/decoding on.

What cpus are you using?

didikunz · January 7, 2024, 2:41pm

I have Caspars (currently sill 2.0.7 but that changes soon) running on dual CPU / dual GPU machines. I found, that it works best, when running multiple instances of the Caspar server. Use different ports to control them independently. Somehow Windows distributes the available CPU’s and GPU’s to the different server instances, so that it runs quite good that way.

paveltonev · January 8, 2024, 1:48pm

Dear Caspar Team,

This is the CPUs info.

I am running CasparCG server under linux from docker, starting it with two consumers and no files for play. I have a requirement to provide 8Mb/s bandwidth.

The initial CPU usage was reduced from 35-40% to 25% installing lastest oneTBB version.

Once a clip for the fist channel is played the CPUs go up to 85% and the bandwidth goes down to 5Mb/s. Once a clip is added for the second channel the CPUs started somehow to oscillate around 80% and the bandwidth goes down to 3Mb/s.

Specifying cpuset in the docker from one node only doesn’t improve the situation.
–cpuset-cpus=0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38

Here is the CPUs dynamics during the test:

Here is the bandwidth dynamics during the test:

Any help is appreciated!

ZimeX · January 8, 2024, 2:28pm

Interesting thread, just my two cents to the discussion. For a test I ran an ubuntu server instance on a proxmox VM and I noticed the performance was lacking. I did not test it further and instead installed ubuntu server directly on the machine instead of proxmox. Then it performed alot better.

I would investigate if it is possible to test without docker and see if it affects performance. I would assume yes, but how much? Could it be something with GPU passthrough to the docker container?

Best Regards
Simon

Julusian · January 8, 2024, 3:39pm

For running on ubuntu without docker, I have been exploring providing a ppa for installing casparcg. CasparCG : “CasparCG” team it has usable build of 2.4.0-beta1 currently available.
I’m not 100% finished on that ppa (might restructure/rename/rearrange packages), but it does work currently

paveltonev · January 9, 2024, 9:30am

Thanks for sharing this info! Once installed, how can it be started/stopped?Does it support GPU encoding?

ZimeX · January 9, 2024, 10:36am

I am also interested in how the package is used. I have only been able to start the server on an ubuntu server via blackbox with rxvt-unicode. Since caspar must have X11 information.
Though I have a custom manager around the server that starts both the server and scanner and provides me with a simple api to restart the server remotely and change the config etc. So maybe I cannot apply this package as easily.
A headless ubuntu-caspar server in my mind and deployment is the most convenient way.

Julusian · January 9, 2024, 10:57am

This package doesn’t provide any help with starting/stopping so far, it just provides the executable and ensures all needed dependencies are installed.

It uses whatever ffmpeg is installed on the system. It looks like ffmpeg -encoders includes nvenc, so I expect that it will work.

I expect that you can use this ppa, and change your manager to invoke the executable it installs instead.
That is a good point that the scanner should also be a package available here.

But yeah, because of the requirement for X11 (when I last checked, this was very hard to remove as CEF also required it) I am not sure how best to launch it. So for now these packages are leaving that up to the user.

If would be interesting to see what people are doing to solve launching

paveltonev · January 9, 2024, 12:22pm

I have tested the provided ppa version of the CasparCG server, and the behavior is absolutely same as the listed one above.

On the other side, same test was done on a single socket AMD CPU with two GPUs and this issue is not observed and behavior of the server is as expected.

My assumptions are that tbb::parallel_for and tbb::parallel_invoke usages in the server cause this CPUs overhead, moreover there was such an CasparCG issue in the server and one of the feedback was that removing parallel operations solves the problem. On the other side there is still open oneTBB issue for high CPU usage.

Do you think using another alternative to oneTBB is a good idea?

Do you have in mind some code optimization in the parallel part that could help?

paveltonev · January 11, 2024, 7:43am

What are the CPU and GPU types? Do you run it on Windows or Linux?

paveltonev · January 11, 2024, 7:54am

Just for the test, I have substituted all parallel stuff with sequential one. The CPU usage is same(above 90%) on all CPUs.

I have also profiled the provided ppa casparcg-server-2.4 binary, it takes around 10% usage, thus the other 80% are busy cycles.

I have tried different kernel scheduling profiles, The CPU usage is same(above 90%) on all CPUs.

Any ideas?

paveltonev · January 12, 2024, 7:23am

I have done a test with the ppa distribution you provide just playing a single clip without using any streams and here is the CPUs usage.

Maybe some of the libraries you use are not suitable for dual cpus, could you guess which ones?

paveltonev · January 18, 2024, 3:06pm

Hi @Julusian ,

I fixed the issue above forcing X server to use NVIDIA X driver.

Just execute nvidia-xconfig. As a result X server is not accessible, but the CPU usage is as expected.

ZimeX · January 19, 2024, 7:52am

How did you manage to use nvidia-xconfig? What did you do differently. Is it running inside docker? What do you mean the x-server is no accessible? Not able to RDP into?

Very interesting that you managed to solve it. Providing a little more detail would probably help the next person with a similar issue immensly

paveltonev · January 29, 2024, 11:21am

Hi @ZimeX ,

The issue with high cpu usage came from the fact that mixing, which is using X11 server, was done with CPUs rather than with GPU(s), because of missing xorg.conf file. The nvidia-xconfig generates xorg.conf, of course you can write it on your own, where the most important step is on Device declaration where you say which card to use for the mixing.

Section "Device"
    Identifier     "Device0"
    Driver         "nvidia"
    VendorName     "NVIDIA Corporation"
    BusID          "PCI:5@0:0:0"
EndSection

Once you have it, just restart the linux display manager and everything is fixed.

ZimeX · January 29, 2024, 2:35pm

Thank you for the response.

Are you still running CasparCG through a Docker container? Do you have any decklink outputcards attached to the container or what are you outputting?

/Simon

paveltonev · January 29, 2024, 6:01pm

Hi @ZimeX

The output is to udp. It is running from both docker container and standalone.

/Pavel

Julusian · January 29, 2024, 6:36pm

I did test using decklinks in docker once, this needs passing through certain devices:

github.com

CasparCG/server/blob/a6fdb6a5a54cfa4ee4790e3c738d8d4853dc18eb/tools/linux/start_docker.sh#L24-L33


      
          DECKLINK_API_SO="/usr/lib/libDeckLinkAPI.so"
          if [ -f "$DECKLINK_API_SO" ]; then
            CMD="$CMD -v $DECKLINK_API_SO:$DECKLINK_API_SO:ro"
          
            for dev in /dev/blackmagic/*; do
              if [ -f "$dev" ]; then
                CMD="$CMD --device $dev"
              fi
            done
          fi