The shaders to unpack YUV information from the same texture were rather
complicated. Breaking them up into separate textures makes the shaders
much simpler, and we can remove the PRECISION_OFFSET hack.
Performance also gets a nice boost on Intel for planar textures.
Intel GPA, SetStablePowerState, Intel HD Graphics 530, 1920x1080
UYVY: 473 us -> 457 us
YUY2: 492 us -> 422 us
YVYU: 491 us -> 441 us
I420: 1637 us -> 505 us
I422: 1644 us -> 482 us
I444: 1653 us -> 504 us
NV12: 1656 us -> 369 us
Y800 (limited): 270 us -> 277 us
Y800 (full): 263 us -> 289 us
RGB (limited): 341 us -> 411 us
BGR3 (limited): 512 us -> 509 us
BGR3 (full): 527 us -> 534 us
The video format is not updated if switching between cache-compatible
formats, e.g. YUY2 and YVYU, resulting in the wrong conversion technique
being used. This change ensures the format is always up-to-date.
This change only wraps the functionality. I have rough code to exercise
the the query functionality, but that part is not really clean enough to
submit.
The shaders to pack YUV information into the same texture were rather
complicated and suffering precision issues. Breaking them up into
separate textures makes the shaders much simpler and avoids having to
compute large integer offsets. Unfortunately, the code to handle
multiple textures is not as pleasant, but at least the NV12 rendering
path is no longer separate.
In addition, write chroma samples to "standard" offsets. For I444,
there's no difference, but I420/NV12 formats now have chroma shifted to
the left as 4:2:0 is shown in the H.264 specification.
Intel GPA, SetStablePowerState, Intel HD Graphics 530
Expect speed incrase:
I420: 844 us -> 493 us (254 us + 190 us + 274 us)
I444: 837 us -> 747 us (258 us + 276 us + 272 us)
NV12: 450 us -> 368 us (319 us + 168 us)
Expect no change:
NV12 (HW): 580 (481 us + 166 us) us -> 588 us (468 us + 247 us)
RGB: 359 us -> 387 us
Fixes https://obsproject.com/mantis/view.php?id=624
Fixes https://obsproject.com/mantis/view.php?id=1512
Use bilinear filtering to reduce 36 taps to 25 for the regular path.
This works because the middle weights are always between 0 and 1,
allowing texture coordinates to be placed strategically to sample
correct ratios. I'm not sure about the undistort path, so I've left that
alone.
Also remove scaling added in #526, after which weight normalization is
unnecessary. If we want to use or invent an algorithm with alternate
downscaling properties, that's fine, but I don't think we should change
Lanczos scaling to mean something it's not. The scale implementation was
also seen not working when applied directly to scene items because of
assumptions made about the projection matrix.
Intel GPA, SetStablePowerState, Intel HD Graphics 530, D3D11
644x478 -> 1323x1080: 3890 us -> 3401 us
1920x1080 -> 1280x720: 2555 us -> 2261 us
(This also modifies the UI module)
Adds the ability for a source to monitor by default. This is mainly
aimed at browser sources, so that they do not stop outputting audio by
default like they used to.
When mixing sampling with raw loads in a shader, ending a shader with a
load would case the default sampler to become unset for OpenGL. Instead,
initialize with no sampler, and only set if there is a sampler.
RGB to YUV converison was previously baked into every scale shader, but
this work has been moved to the YUV packing shaders. The scale shaders
now write RGBA instead. In the case where base and output resolutions
are identical, the render texture is forwarded directly to the YUV pack
step, skipping an entire fullscreen pass.
Intel GPA, SetStablePowerState, Intel HD Graphics 530, NV12
1920x1080, Before:
RGBA -> UYVX: ~321 us
UYVX -> Y: ~480 us
UYVX -> UV: ~127 us
1920x1080, After:
[forward render texture]
RGBA -> Y: ~487 us
RGBA -> UV: ~131 us
1920x1080 -> 1280x720, Before:
RGBA -> UYVX: ~268 us
UYVX -> Y: ~209 us
UYVX -> UV: ~57 us
1920x1080 -> 1280x720, After:
RGBA -> RGBA (rescale): ~268 us
RGBA -> Y: ~210 us
RGBA -> UV: ~58 us
There are devices like the GV-USB2 that produce frames with smmoth
timestamps at an uneven pace, which causes OBS to stutter because the
unbuffered path is designed to aggressively operate on the latest frame.
We can make the unbuffered path work by making two adjustments:
- Don't discard the current frame until it has elapsed.
- Don't skip frames in the queue until they have elapsed.
The buffered path still has problems with deinterlacing GV-USB2 output,
but the unbuffered path is better anyway.
Testing:
GV-USB2, Unbuffered: Stuttering is gone!
GV-USB2, Buffered: No regression (still broken).
SC-512N1-L/DVI, Unbuffered: No regression (still works).
SC-512N1-L/DVI, Buffered: No regression (still works).
It's a waste of GPU time to do two fullscreen passes to render final mix
previews. Use blend states to simulate the black background of
DrawBackdrop() for the following situations:
- Main preview window (Studio Mode off)
- Studio Mode: Program
This does not effect:
- Studio Mode: Preview (still uses DrawBackdrop)
- Fullscreen Projector (uses GPU clear to black)
- Windowed Projector (uses GPU clear to black)
intel GPA, SetStablePowerState, Intel HD Graphics 530, 1920x1080
Before:
DrawBackdrop: ~529 us
main texture: ~367 us (Cheaper than drawing a black quad?)
After:
[DrawBackdrop optimized away]
main texture: ~383 us
Add a separate shader for area upscaling to take advantage of bilinear
filtering. Iterating over texels is unnecessary in the upscale case
because a target pixel can only overlap 1 or 2 texels in X and Y
directions. When only overlapping one texel, adjust UVs to sample texel
center to avoid filtering.
Also add "base_dimension" uniform to avoid unnecessary division.
Intel HD Graphics 530, 644x478 -> 1323x1080: ~836 us -> ~232 us
Unlike get_properties, there is not reason to not call get_defaults if it is
given in addition to get_defaults2. Additonally this fixes the bug with
'init_encoder' which would only ever call get_defaults, resulting in broken
encoders if those used get_defaults2.
This implements pausing of outputs. To accomplish this, raw audio/video
data is halted to the encoders or raw output. Pausing is as precisely
timed as possible according to the timing of the obs_output_pause call,
and audio data will be spliced down to the exact audio sample in
accordance to that timing at the start/end marks.
Outputs that support this (outputs used for recording) can set the
OBS_OUTPUT_CAN_PAUSE capability flag.
If the audio subsystem was buffered to any extent, the audio of a raw
output would start off at a negative offset, requiring each raw output
to implement a "prepare_audio" function (as seen in the FFmpeg output)
in order to ensure proper synchronization with video. This did not
apply to encoded outputs because it was already being performed by the
obs-encoder code.
If an audio source does not provide enough data at a steady pace, the
timestamp update does not happen, and buffering increases until it
maxes out. To counteract this, update the timestamp anyway.
Another issue for decoupled audio sources is that timing is not
adjusted for divergence from system time. Making this adjustment is
better for timing stability.
5+ hours of stable audio without any buffering on my GV-USB2 where it
used to add 21ms every 5 mintues or so.
Fixes https://obsproject.com/mantis/view.php?id=1269
Fix ternary test to use BGRX render targets for YUV to RGB
conversions. The previous behavior may have been fine though since
the shaders fill the alpha channel with 1.0 anyway.
Code submissions have continually suffered from formatting
inconsistencies that constantly have to be addressed. Using
clang-format simplifies this by making code formatting more consistent,
and allows automation of the code formatting so that maintainers can
focus more on the code itself instead of code formatting.
In 'libobs/CMakeLists.txt', use '${CMAKE_INSTALL_LIBDIR}' instead of
'${CMAKE_INSTALL_PREFIX}/lib', as the latter results into 'libobs.pc'
being installed under '/lib' when '/lib64' would be more appropriate.
In 'libobs/libobs.pc.in', use '@CMAKE_INSTALL_FULL_LIBDIR@' for
'libdir', '@CMAKE_INSTALL_FULL_INCLUDEDIR@' for 'includedir',
and '@CMAKE_INSTALL_PREFIX@' for 'prefix'.
Gentoo-Bug: https://bugs.gentoo.org/644538
The cache coherency of rasterization for full-screen passes is better
using an oversized triangle that is clipped rather than two triangles.
Traversal order of rasterization is GPU-specific, but will almost
certainly be better using an undivided primitive.
A smaller benefit is that quads along the diagonal are not evaluated
multiple times, but that's minor in comparison.
Redo format shaders to bypass vertex buffer, and input layout. Add
global shader bool "obs_glsl_compile" to make API-specific decisions,
i.e. handle upside-down UVs. gl_ortho is not needed for format
conversion because the vertex shader does not use ViewProj anymore.
This can be applied to more situations, but start small first.
Testbed full screen passes, Intel HD Graphics 530:
RGBA -> UYVX: 467 -> 439 us, ~6% savings
UYVX -> uv: 295 -> 239 us, ~19% savings
Similar to item_visible, this event fires whenever a scene item is
locked or unlocked. This allows the UI and libobs to remain in sync
regarding scene elements' statuses.