Welcome to WSL!

New to the forum? Please read this and this.

Fu|Re RAM/VRAM Management and calling for more bug reports

Moderator: Chad

User avatar
UserNoah
Fusioneer
Posts: 194
Joined: Mon Mar 09, 2020 11:43 am
Been thanked: 6 times
Contact:

RAM/VRAM Management and calling for more bug reports

#1

Post by UserNoah »

This is sort of an unusual bug report, because I hope that I can bring more people to actually report bugs and issues to Blackmagicdesign and need your help in this particular case. There is a comp file in the next post that should show you the issue when rendering.

Fusion version:
16 to 16.2

OS and version:

Windows 10 latest updates

Additional relevant system info:
RTX 2060 super


Description of the bug:

Fusion fills up the RAM and VRAM with previous frames until it's full and only seems to clear enough VRAM for one more frame. This procedure ends in short system freezes and vastly higher render times. This can be fixed by using the frame render script fu:PurgeCache() but it means that still frames and not animated masks are purged from memory as well.

The reason I am writing this, is because I have messaged the Blackmagicdesign support detailing this, because this bugs me ever since Fusion 16. I have chosen F16 over F9 for this project, because I feel like the Delta keyer, 3D and planar tracker performance is much better in F16, which I needed a lot of for this. And know about the PurgeCache script that stabilizes the render times.

But the support answered me, that this is the first time they heard about this and it could be connected to my system. They asked me for more system logs and want me to do a clean reinstall of drivers, disable anti Virus software etc. I can do all of that, but I know this isn't connected to any of this.
While I am currently working as a freelancer on my own rig (newly assembled from the ground up in April), this issue was always present in F16, even when I was working at a small studio where I was using Fusion on 4 different computers. All different configurations and all having this issue.

I have sent a video to the support that I am sharing here as well. It's ugly, but does the job:



And this is what I wrote to the support:
I captured the video using a camera, filming my display, since OBS or other screen capture software takes a big percentage of performance and VRAM.
I set the Fusion UI to 200%, hopefully, you are able to see it properly, even though the YouTube compression is pretty strong.

The comp I am rendering is a mid-sized green screen project with a few delta keyers, some optical flow, and vector motion blur (which still produces artifacts in GPU mode, which I described in a bug report a few months back) and Render3Ds that only render three Imageplanes.
I can't share this comp file due to NDAs, but the problem isn't specific to this composition, it's simply the last shot that I had worked on.

I go through different render configurations: Render Manager, rendering in the composition, rendering with the PurgeCache script enabled, and rendering only one frame at a time.
In conclusion, Fusion struggles to clear previous frames out of the RAM/VRAM and takes more time clearing just enough for one more frame, than actually rendering.

Of course, using the PurgeCache script isn't difficult and is the fastest way of rendering right now, but if Fusion would be a little smarter handling the RAM/VRAM it could be even faster because I wouldn't need to force purge everything in the RAM/VRAM,
including the tools that wouldn't need to be processed again each frame (still frames, not animated masks, backgrounds, texts, etc).
I'm not a developer and could be wrong, but looking at the fact that even when I am only rendering a single frame at a time, the VRAM and RAM is still filling up completely and the spikes in render time happen. This has to mean that Fusion keeps unnecessary Data around, slowing down everything.
Even if this wouldn't increase render times, this still would be unnecessary memory consumption that could be spend elsewhere in Fusion, like rendering more frames at a time, since it's only using around 40% of my system anyway.
This issue is not present in any other software I have used on my own rig or the ones of my previous employers. Not even GPU renderer like Redshift, Octane, ProRenderer, GPU tools like Vellum, Volume sims in Houdini, After Effects or Resolves editing tab have made in any way issues.

So if you have noticed this as well, or maybe you can test this out, please write to the Blackmagicdesign support. Not just for this issue, but with all of them. They can't fix them if they don't hear people having issues. Don't trust that they read the posts in the forums.

It would be interesting to see if this is Windows only or if Mac and Linux are affected as well.

I am unsure if I should post this on the official forums as well, in hopes of more people reading it and hopefully acknowledging this problem?

Severity (Trivial, Minor, Major, Critical)
Major, Minor if you know about the PurgeCache script but generally annoying.


Steps to reproduce:
Have a composition with GPU tools like vector motion blur, optical flow or Render3D. It needs to be long enough to fill your VRAM.

Edit: You can use the test comp I posted below.
Last edited by UserNoah on Sun Jun 07, 2020 3:55 am, edited 2 times in total.

User avatar
UserNoah
Fusioneer
Posts: 194
Joined: Mon Mar 09, 2020 11:43 am
Been thanked: 6 times
Contact:

Re: RAM/VRAM Management and calling for more bug reports

#2

Post by UserNoah »

I realize this might be a lot to ask for so I created a test comp with a lot of GPU tools, specifically designed to fill your VRAM.
vram_test.comp
Its a simple 3D scene with some post-fx and minimal branching. The saver has 2 versions. 1 is the default and 2 is with the Purge Render script. The saver will save the images in a folder below the comp, so probably in your Downloads folder if you don't place it somewhere else.

On my system it took around 18 frames until the render times jumped from 6 seconds a frame to 2 minutes a frame. They didn't recover from there on and I didn't finish the render because it would've taken ages. But it didn't freeze my system when purging the VRAM on it's own, so there must be another variable that's responsible for that.

Using the PurgeCache script increased the render times to around 15 seconds a frame, BUT they stayed constant. Overall, meaning a shorter render time.

If possible, please test this on your machine, you can open the taskmanager and keep an eye on the VRAM. You don't have to render the whole comp, it would be enough to know if the same spikes happen on your system. Information on what OS and GPU you're using would be great.

I also tested this in Resolve and it happened exactly how I suspected. Rendering this comp in Resolve with a saver shows the same spikes, but using a Media Out and rendering through the deliver tab has no spikes in render times, since Resolve now takes care of the memory. Meaning, rendering with a saver: Estimated render times of over 1 hour, rendering in the delivery tab, actual render times of 12 minutes. Isn't resolve not able to branch and render several frames at once? Because this is even faster than rendering with the purge cache script in Fusion.

But this isn't a solution just a demonstration of the problem.
You do not have the required permissions to view the files attached to this post.

User avatar
AndrewHazelden
Fusionator
Posts: 1768
Joined: Fri Apr 03, 2015 3:20 pm
Answers: 11
Location: West Dover, Nova Scotia, Canada
Been thanked: 39 times
Contact:

Re: RAM/VRAM Management and calling for more bug reports

#3

Post by AndrewHazelden »

Hi @UserNoah.

My 100% non-canonical, 2 (Canadian) cents thrown in on this Bug Tracker topic is as follows:

For Fusion Studio v16.x usage, it might potentially be a memory leak inside of the LuaJIT (JIT) interpreted memory stack. And that issue could be what is causing the memory management failures to happen, and that (in theory) might trigger the lack of releasing your topped out/red-lined RAM/VRAM bug to occur on your high-powered GPUs.

LuaJIT and your Comps

The JIT realtime execution features exist (AFAK) in just *some* of Fusion's nodes, of which the extra bonuses like the garbage collector or FFI support is another pair of sleeves (operationally speaking) compared to troubleshooting pure Fusion timeline RAM caching, and OpenGL GPU rendering tasks done in a Renderer3D node.

With JIT bugs you'd have to modify your debugging technique beyond simply watching memory usage levels in Fusion's text-based memory gauges, or via the Task Manager (win)/Activity Monitor (macOS)/Top (Linux).

Image

I kind of expect that (in theory) sampling of the Fusion process (macOS), or other approaches like using gdb is essential to track down what you are experiencing.

Image

LuaJIT has near-magical on-the-fly code optimizations, and other tricks like FFI support that go far beyond what mortal-made Lua tools are expected to be able to do. It (LuaJIT) is also an opaque black box to interact with if your name isn't an expert like "@MikePail" or "@PeterLoveday".

Who do I talk to if I want to get my issues Resolved?

BMD would likely have troubleshooting guides to quickly help you resolve issues like the GPU related bugs you are having today.

I think that Dwaine, the BMD DaVinci Specialist in the United States is possibly the correct support avenue for you to track down if you are able to reach out to him.

Image

What other tech broke with the JIT modifications found in v16.2.x?

The recent Fusion Studio (standalone) and Resolve Fusion page v16.2.x releases from BMD added a few un-expected show-stopping bugs that wreaked havoc with a ton of pre-existing Reactor content (assuming the code used a loop that counted to a number higher then ~60 in any Lua script, or possibly in some fuses).

This new JIT feature regression was kindly noted and reported by Movalex and Millolab on the BMD forums back in March, and overall that issue is (still) not fully Resolved™ to a satisfactory level (IMHO).

Image

FWIW, I still get bug reports even as recent as two days ago via email, from far away countries, letting me know that KartaVR is impacted by this v16.2.x bug, simply because I didn't revise all Lua scripts *I've ever written* inside Reactor to manually force OFF the speed-tuning improvements in the FuScript API's Lua code execution... with the anti-productivity optimization of "jit.off()" in the interim. Oh bullocks. Not this cycle again :bmd:...!?¿? :bmd:

Past JIT Issues in Fusion v8 - v9

Back in Fusion v8 and v9 the LuaJIT memory stack caused limitations in how many CustomTool nodes, EXRIO fuses, and on-the-fly Lua simple : expression evaluating based tools you were able to add to a single mega-comp.

IIRC @Kristof on WSL did a node copy/paste and then render test in ~2017/2018-ish, and there was some random upper bounds in Fusion (free) v8.2.1 to 9.0.2+ of like ~250 to ~1000 or so nodes before it all blew up and imploded. Around that era, I made a custom UI Manager tool called PasteNode.lua to help stress test the JIT limits in Fusion v9.0.1. Man, those were the days when everything got a UI Manager treatment on WSL. :D

Image

TBH my full recollections today are a bit hazy on the exact details and race-conditions / trigger events that causes JIT issues to jump off a pier in FU with 100% repeatable accuracy.

Someone else on WSL like the core "Cryptomatte for Fusion" dev team of @Kristof, and @Cedric will know far more about this JIT stuff in Fu v9 - 16+ than I do, for sure, since it's likely etched in their minds from mildly-permanent Fuzionmonger related battle scars. :!:

Andrew's suggestions to Noah, assuming your goals are to get your VFX/comp efforts back on track today:

Dump Fusion Studios' Built-in Render Manager GUI and use *any* other tool to run your Fusion Render Nodes via the CLI. Any alternative job controlling tool will do. Pick one and use it for your needs today! Seriously. :roll:


There are several free or indie-artist affordable external render manager program on the market, and some of them offer ~3 nodes for free like deals if you want to try it out on a small-farm scale for a quick test.

The concept I am suggesting to you @UserNoah today, is that you should try to have that farm controller program be responsible for launching the current comp file at each of your Fusion Render Node based applications, that are installed on all of your Fusion Render Node based "slave" systems.

You would limit the frame range to between ~1 to 15 or so frames, per individual small render job task/chunk.

And you would tell the the render controller program to pass each of your render node boxes "one chunk" of frames, to be rendered from your active Fusion comp file, at a time.

The external render farm controller program you choose to use, will then terminate the Fusion Render Node app on the completion of that multi-frame chunk job, and then your memory will be released fully by the OS and you have relieved the "memory pressure" you are seeing on your RAM/VRAM.

Now you are 99% less likely to have the failed network renders happen... and won't even see a RAM/VRAM issue build up on your GPU powered Fusion Render Farm farm nodes (most days). Sure, that's a workaround but it does work.

Also source your media from EXR sequences, and render on the farm to EXR sequences and you'd also have less failure in your Fusion based efforts. :)

And this is a neat bonus tip you could explore, in time: Using 2 GPUs on a Fusion Studio Compositing Workstation


Deadline and Other Options

I've heard of a lot of people using Deadline with Fusion (some issues exist and there are threads on WSL about Deadline).

For my needs, I've been exploring whacky non-Deadline based render managers for macOS and Linux usage of Fusion like:

Phase 1) In the past from 2015 to Nov 2019 I used RenderPal for Fusion. It was great in its day.

Image

Image

Phase 2) Then after Nov 2019 I moved onto Pixar's "Tractor" render manager with a home-spun DIY Fusion Studio + Render Node binding using Lua exported Alfred .alf files. Overall, I think Tractor's pretty nifty stuff, and you get a license of Tractor with every copy of RenderMan for Blender/Maya/Houdini/Katana you have in your life! In 2020, Go Tractor for the win! :)

Image

Image

Phase 3) And now for AWS/Azure cloud usage, I'm exploring (but not using fully yet) the open-source high-throughput/high-performance scientific task distribution tools like "HT Condor" and "SLURM" for my future rendering needs in 2021.

(SLURM isn't just a slug extract based soft-drink from Futurama anymore, it also does wonders for top-500 supercomputing sites, and Fusion users with exacting requirements.)

A CLI based HT condor job task looks like this set of three shell files below. I ran them on a macBookPro 2015 laptop using the source EXR multi-channel differential fixed-noise-pattern denoising image examples available in Reactor using a brew package manager installed "personal" HT Condor deployment:
Code: [Select all] [Expand/Collapse] [Download] (job_submission.sh)
  1. #!/bin/bash
  2. # condor job submitter
  3.  
  4. echo
  5. printf "Start time: "; /bin/date
  6.  
  7. echo
  8. echo "Submitting HTCondor jobs..."
  9.  
  10. cd $HOME/Desktop/altus_condor_v3/
  11.  
  12. condor_submit altus_job.submit
  13.  
  14. echo "HTCondor submission complete!"
Code: [Select all] [Expand/Collapse] [Download] (altus_job.submit)
  1. # The UNIVERSE defines an execution environment. You will almost always use VANILLA.
  2. Universe = vanilla
  3.  
  4. # Lock all the cores on a MBP
  5. request_cpus = 1
  6.  
  7. # EXECUTABLE is the program your job will run It's often useful
  8. # to create a shell script to "wrap" your actual work.
  9. Executable = altus_job.sh
  10.  
  11. Arguments = "$(Process) $ENV(HOME)/Desktop/altus_condor_v3/images satellite_lighting_test_v012_FUL"
  12.  
  13. # ERROR and OUTPUT are the error and output channels from your job
  14. # that HTCondor returns from the remote host.
  15. Error = logs/altus_job.$(Process).error
  16. Output = logs/altus_job.$(Process).output
  17.  
  18. # The LOG file is where HTCondor places information about your
  19. # job's status, success, and resource consumption.
  20. Log = logs/altus_job.$(Process).log
  21.  
  22. # QUEUE is the "start button" - it launches any jobs that have been
  23. # specified thus far.
  24. Queue 5
  1. #!/bin/bash
  2. # condor job description
  3.  
  4. echo
  5. echo -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
  6. echo
  7. printf "Start time: "; /bin/date
  8. echo
  9. printf "Job is running on node: "; /bin/hostname
  10. echo
  11. printf "Job running as user: "; /usr/bin/id
  12. echo
  13. printf "Job is running in directory: "; /bin/pwd
  14. echo
  15. echo
  16. echo "Starting Altus job..."
  17. echo
  18.  
  19. # Variables
  20. ALTUSOUT="altus_out"
  21. ALTUSPROGRAM="/Applications/Altus/altus-cli"
  22.  
  23. #F=120
  24. F=$1
  25.  
  26. KC1=0.45
  27. KC2=8
  28. KC4=5
  29. KF=0.6
  30. RADIUS=10
  31. QUALITY="preview"
  32. DEVICEID=2
  33. PLATFORMID=1
  34.  
  35. # Pad the frame number
  36. printf -v F04 "%04d" $F
  37. echo ${F04}
  38.  
  39. # Content to process
  40. B0PATH="${2}/${3}_b0.${F04}.exr"
  41. B1PATH="${2}/${3}_b1.${F04}.exr"
  42. FINALFRAMEPADDED="${2}/${ALTUSOUT}/${3}_nr.${F04}.exr"
  43.  
  44. #B0PATH="/Altus/satellite_lighting_test_v012_FUL_b0.${F04}.exr"
  45. #B1PATH="/Altus/satellite_lighting_test_v012_FUL_b1.${F04}.exr"
  46. #FINALFRAMEPADDED="${ALTUSOUT}/satellite_lighting_test_v012_FUL_nr.####.exr"
  47.  
  48. mkdir -p "${2}/${ALTUSOUT}/"
  49.  
  50. echo [Frame] ${F04}
  51. echo [B0] "$B0PATH"
  52. echo [B1] "$B1PATH"
  53.  
  54. "${ALTUSPROGRAM}" \
  55. --out-path="${FINALFRAMEPADDED}" \
  56. --rgb-0="${B0PATH}" \
  57. --rgb-1="${B1PATH}" \
  58. --pos-0="${B0PATH}::worldPositions" \
  59. --pos-1="${B1PATH}::worldPositions" \
  60. --nrm-0="${B0PATH}::worldNormals" \
  61. --nrm-1="${B1PATH}::worldNormals" \
  62. --alb-0="${B0PATH}::diffuseFilter" \
  63. --alb-1="${B1PATH}::diffuseFilter" \
  64. --vis-0="${B0PATH}::matteShadow" \
  65. --vis-1="${B1PATH}::matteShadow" \
  66. --kc_1="${KC1}" \
  67. --kc_2="${KC2}" \
  68. --kc_4="${KC4}" \
  69. --kf="${KF}" \
  70. --radius="${RADIUS}" \
  71. --quality="${QUALITY}" \
  72. --device-id="${DEVICEID}" \
  73. --platform-id="${PLATFORMID}" \
  74.  
  75. echo
  76. echo [Altus Output]
  77. echo "${FINALFRAMEPADDED}"
  78. echo
  79. echo -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
  80. echo
  81. echo "Altus job complete!"

User avatar
UserNoah
Fusioneer
Posts: 194
Joined: Mon Mar 09, 2020 11:43 am
Been thanked: 6 times
Contact:

Re: RAM/VRAM Management and calling for more bug reports

#4

Post by UserNoah »

Hello Andrew thank you so much for your detailed explanation!

I am in contact with the support and as mentioned before, they have never heard about this issue.
I am using Deadline as a render manager, that unfortunately doesn't work with F16. And breaking up the comp would require me to always pre cache everything like the trails node or particle sims. Even if I only render on a single machine. Which of course is possible but only a workaround for an issue that should be fixable, especially because switching to Resolve and rendering in the delivery tab does not have the same problems.

Regarding exrs. I learned the hard way after wondering why Fusion takes ages to Load in and Out pngs to only every feed it exr :D

I also think Fusion uses too much RAM in general. There is no difference in RAM usage if I render 1 or 5 frames at the same time. At the same frame number it will fill everything that it's allowed to fill. Of course reducing the number of frames helps if it actually runs out of RAM but why does it always have to keep so much unnecessary Data in the first place?

I will update this post when I know anything more from BLMD and in the meantime try more of your workarounds Andrew.
I'm not sure I made this clear enough in all of my complaining, I am very very grateful you've written such an amazingly detailed post to help me!

User avatar
UserNoah
Fusioneer
Posts: 194
Joined: Mon Mar 09, 2020 11:43 am
Been thanked: 6 times
Contact:

Re: RAM/VRAM Management and calling for more bug reports

#5

Post by UserNoah »

I just heard back again from BLMD Support and they have been able to replicate the behavior and have passed it to the Dev team. Hopefully, this can be fixed soon, I will update this post if I hear anything else.

User avatar
AndrewHazelden
Fusionator
Posts: 1768
Joined: Fri Apr 03, 2015 3:20 pm
Answers: 11
Location: West Dover, Nova Scotia, Canada
Been thanked: 39 times
Contact:

Re: RAM/VRAM Management and calling for more bug reports

#6

Post by AndrewHazelden »

Hi @UserNoah.
UserNoah wrote:... they have been able to replicate the behavior and have passed it to the Dev team
That's awesome news. Congrats on pushing forward with your BMD support ticket efforts to try and get a resolution to the issue. 👍

User avatar
UserNoah
Fusioneer
Posts: 194
Joined: Mon Mar 09, 2020 11:43 am
Been thanked: 6 times
Contact:

Re: RAM/VRAM Management and calling for more bug reports

#7

Post by UserNoah »

I have just rendered my test comp in Fusion 17 and it rendered without crippling itself due to the VRAM management. I also tested it again with the newest build of 16 and it started to hang and struggle after frame 18. I hope this means that this memory leak is finally fixed!

User avatar
sk-films
Posts: 1
Joined: Wed Nov 25, 2020 3:25 pm

Re: RAM/VRAM Management and calling for more bug reports

#8

Post by sk-films »

Unfortunately it's not:

Getting the same behavior in Fusion 17.2.1.
I didn't test yet your comps, but I'm getting these problem in mine.

Tried to Flush memory every 20 frames with this code, but it feels like just ignoring the "comp.Curenttime" when on RenderNode:

Code: Select all

if comp.CurrentTime/20 == math.floor(comp.CurrentTime/20) then fu:PurgeCache() end
So currently ended up with simple c# Console app, that kills and restarts Render Node every 30 minutes.

User avatar
marcelpinter
Posts: 1
Joined: Wed Jun 30, 2021 10:56 am

Re: RAM/VRAM Management and calling for more bug reports

#9

Post by marcelpinter »

Hi all,

exactly the same thing happening to me as well so the problem might be more common than they think and I hope they figure out the solution for it soon. Dealing with the 12k braw clip is nearly impossible and only the transform node applied (and while on the topic, they really should include the position keyframe smoothing in the edit tab), even with 128gb of ram and the 3090.

Oh and I'm using Resolve 17.2.1

User avatar
UserNoah
Fusioneer
Posts: 194
Joined: Mon Mar 09, 2020 11:43 am
Been thanked: 6 times
Contact:

Re: RAM/VRAM Management and calling for more bug reports

#10

Post by UserNoah »

I have set this topic back to unsolved and affecting Fusion Studio and Resolve Fusion page due to several users reporting issues.
I do believe there are a few separate issues going on at the same time but it's difficult for the end user to actually know where the issue stems from.

While my test comp I initially created does not cause these insane slowdowns since a few versions of Fusion, there are a couple of other issues with VRAM not being cleared early enough.