VP9 Hardware Acceleration is Real

20/06/2016

Hardware acceleration for video codecs is almost mandatory.

VP9 is getting a performance boost

VP9 is getting a performance boost

There are three things that keep VP8 in the game when compared to H.264:

  1. It was the only video codec in Chrome for WebRTC in the last 5 years, giving it a headstart in deployments
  2. H.264 while available in mobile chipsets isn’t always accessible for the developer (or works as it should when it is accessible)
  3. VP8 and H.264 are rather old now, so software implementations of them are quite decent

 

With VP9, the main worry was that it will be left behind and not get the love and attention from chipset vendors – leading it to the same fate as VP8 – abysmal, if any, hardware acceleration support. It is probably why Google went to great lengths to make it running on YouTube so soon and is publicizing its stats all the time.

This worry is now rather behind us. Recent signs show some serious adoption from the companies that we should really care about:

#1 – ARM

Mobile=ARM

Without checking stats, I’d say that 99% or more of all smartphones sold in the past 5 years are based on ARM.

If and when ARM decides to support a feature directly, that brings said feature very close towards world domination in future smartpones.

Which is somewhat what happened last week – ARM announced its Mali Egil Video Processor with VP9 acceleration.

Here’s a deck they shared:

Being farther away from chipsets than I were 5 years ago, it is hard for me to say if this is an integral part of an ARM processor, but I believe that it isn’t. It is an add-on component that takes care of video processing that chipset vendors add next to their ARM core. They can source the design from ARM or other suppliers – or they can develop their own.

Not sure how popular the ARM alternative is for video processing, but they have the advantage of being the first alternative for any chipset vendor (hell – they already source the ARM core itself, so why not bundle?). Which also means every other vendor needs to match up to their feature set – and improve on it.

Now that VP9 encode/decode capabilities are front and center in the ARM Mali Egil, it has become a mandatory checkmark for everyone else as well.

#2 – Intel

If ARM is the king of mobile, then Intel rules the desktop.

As with ARM, I haven’t been following up on Intel CPU acceleration lately. And as with ARM, it was Fippo who got my attention with this link here: the new Intel Media SDK.

For those who don’t know, Intel is providing several interesting software packages that make direct use of its chipset capabilities. Especially when it comes to optimizing different types of workloads. The Intel IPP and Media SDKs handle media related processing, and are quite popular by low level developers who need access to such facilities.

From the release page itself:

With this release we are happy to announce new full hardware accelerated support for HEVC and VP9.

  • HEVC Main 10 (10-bit) encoder and decoder support
  • VP9 8-bit and 10-bit decoder support

So… HEVC (=H.265) has encode and decode while VP9 only has decode support.

Probably because HEVC has been in the works for a lot longer than VP9, but there’s hope still.

#3 – Alliance of Open Media

The Alliance of Open Media. I’ve published a recent update on the alliance.

Intel was there from the start. The recent additions include ARM, AMD and NVIDIA.

I am sure additional chipset vendors will be joining in the coming months – there seems to be a ramp up in memerships there, with Ateme and Adobe added to their logos just last week.

While the alliance is about what comes after VP9, it is easy to see how these vendors may sway to using VP9 in the interim.

The Future

The future is most definitely one of royalty free video codecs. We’ve got there with voice, now that we have OPUS (though Speex and SILK were there before to pave the way). We will get there with video as well.

Coding technologies need to be accessible and available to everyone – freely – if we are to achieve Benedict Evans’ latest claims: Video is the new HTML. But for that, I’ll need another post.

So… which of these video codecs should you use in your application? Here’s a free mini video course to help you decide.

Responses

Constantine says:
September 26, 2016

I’m curious, would it be easy to add a support to any of VAAPI drivers, or VP9 needs a special hardware support? (seems not, but then why it is not there yet?)

Reply
    Tsahi Levent-Levi says:
    September 26, 2016

    Constantine,

    As far as my understanding goes, like any other video codec, you really do want hardware acceleration for it – and like any other video codec – such support is special at least to some extent.

    Reply
    Dennis Mungai says:
    June 3, 2017

    Hello there,

    Adding the reference encoders and decoders in VAAPI isn’t trivial, because Intel’s drivers (i915, and the intel-hybrid-driver used to expose hybrid decode capabilities for SKUs such as Skylake for VP9 and HEVC 10-bit) must support the texture format (VAAPI_VLDs) that the VAAPI stack mandates.

    It’s for this reason that some platforms, such as AMD’s Polaris can do HEVC 10-bit decode in hardware via VAAPI because they can accept that texture format through their mesa-gallium driver implementation (Set LIBVA_DRIVER_NAME to radeonsi).

    Reply
minus says:
October 4, 2016

As vaapi drivers are by design driving specialized hardware, which implements decoding (and encoding) of specific format “in metal”, you first need such hardware. And it is easy to support vp9 in vaapi on such hardware. Really it is supported.
Of course it is easy possible to have codec working without specialized hardware, on the “general purpose” processing unit, instead.

Also it may be interest to have codec working by GPU shaders …

Reply
    Tsahi Levent-Levi says:
    October 4, 2016

    Thanks.

    If it is that simple, then why hasn’t it been done so far?

    Reply
    Dennis Mungai says:
    July 4, 2017

    A small comment about having a hardware-accelerated video encode pipeline running on GPU shaders: Yes, it has been done before, in the age before SIP blocks such as NVIDIA’s NVENC, AMD’s VCE and Intel’s QuickSync.

    However, general purpose shaders are very inefficient at encoding, encumbered from the programmability, power draw and video quality tuning. In the past, we’ve had ATI’s AVIVO and Nvidia’s NVCUVENC (CUVID)-based encoding (now deprecated in favor of NVENC), and some encoders such as x264 have even implemented an OpenCL – based lookahead system, which has somewhat diminishing returns on higher end hardware.

    For decoding, a hybrid approach has been used extensively, even on current generation hardware such as Intel’s Skylake utilizing a hybrid mode for HEVC 10-bit and VP9 8 and 10-bit codecs. And that comes with the caveats mentioned above: Increased power draw (at the baseline), and nearly unpredictable performance on varying hardware configurations. For instance, an Intel Iris Pro SKU is likely bundled with a faster CPU, resulting in better decode performance. The same cannot be said of another device form factor, such as a tablet, that utilizes a weaker, binned version of the same integrated GPU.

    And on programmability: Look at AMD’s VCE hybrid encode mode, which is rarely, if at all, used. There are tasks that are best left out of shader pipelines, and video decoding is one of them.

    Over time, we expect to see a wide range of hardware-based SIP blocks implementing support for up and coming codecs such as Alliance for Open Media’s V1, the current VP9, VP8, HEVC and H.264 codecs. Infact, for AOM’s codecs to succeed, they’ll need on-launch hardware-based acceleration to enable mass adoption.

    The thing is, when we scale down silicon to enable the same media playback functionalities on mobile platforms as is on faster mainstream PCs, the argument for hybrid-based approaches to video decode and encodes fades out quickly. System integrators and SIP designers such as Cadence/Tensilica have stepped out to provide dedicated licensed SIP hardware blocks to AMD for both the VCE and the VCN (coming in Vega GPUs) for this very reason.

    And over time, you can expect to see cross-vendor licensing of SIPs on variant platforms for these applications. Intel, for instance, has used PowerVR SGX 535 graphics cores developed by Imagination Technologies under license for their GMA 500 GPUs, and such collaborations will only continue to tighten.

    Regards,

    Dennis.

    Reply
HEVC, VP9 and The Future of Video Codecs | Headjack says:
May 18, 2017

[…] HEVC, since there is not really any hardware accelerated encoding available for it yet, even though both Intel and ARM chips have built-in VP9 hardware decoding (which is why VP9 videos play so smoothly on both PCs and […]

Reply
Dennis Mungai says:
June 26, 2017

Hello guys,

As of today, FFmpeg now supports a VAAPI-based VP9 encoder when FFmpeg is built with –enable-vaapi option: https://gist.github.com/Brainiarc7/24de2edef08866c304080504877239a3

However, you’ll need an Intel Kabylake-based Integrated GPU to take advantage of this feature.

And now, with the new vp9_vaapi encoder, here’s what we get.

Encoder options now available:

ffmpeg -h vp9_vaapi

Output:

Encoder vp9_vaapi [VP9 (VAAPI)]:
General capabilities: delay
Threading capabilities: none
Supported pixel formats: vaapi_vld
vp9_vaapi AVOptions:
-loop_filter_level E..V…. Loop filter level (from 0 to 63) (default 16)
-loop_filter_sharpness E..V…. Loop filter sharpness (from 0 to 15) (default 4)

What happens when you try to pull this off on unsupported hardware, say Skylake?

See the sample output below:

[Parsed_format_0 @ 0x42cb500] compat: called with args=[nv12]
[Parsed_format_0 @ 0x42cb500] Setting ‘pix_fmts’ to value ‘nv12’
[Parsed_scale_vaapi_2 @ 0x42cc300] Setting ‘w’ to value ‘1920’
[Parsed_scale_vaapi_2 @ 0x42cc300] Setting ‘h’ to value ‘1080’
[graph 0 input from stream 0:0 @ 0x42cce00] Setting ‘video_size’ to value ‘3840×2026’
[graph 0 input from stream 0:0 @ 0x42cce00] Setting ‘pix_fmt’ to value ‘0’
[graph 0 input from stream 0:0 @ 0x42cce00] Setting ‘time_base’ to value ‘1/1000’
[graph 0 input from stream 0:0 @ 0x42cce00] Setting ‘pixel_aspect’ to value ‘1/1’
[graph 0 input from stream 0:0 @ 0x42cce00] Setting ‘sws_param’ to value ‘flags=2’
[graph 0 input from stream 0:0 @ 0x42cce00] Setting ‘frame_rate’ to value ‘24000/1001’
[graph 0 input from stream 0:0 @ 0x42cce00] w:3840 h:2026 pixfmt:yuv420p tb:1/1000 fr:24000/1001 sar:1/1 sws_param:flags=2
[format @ 0x42cba40] compat: called with args=[vaapi_vld]
[format @ 0x42cba40] Setting ‘pix_fmts’ to value ‘vaapi_vld’
[auto_scaler_0 @ 0x42cd580] Setting ‘flags’ to value ‘bicubic’
[auto_scaler_0 @ 0x42cd580] w:iw h:ih flags:’bicubic’ interl:0
[Parsed_format_0 @ 0x42cb500] auto-inserting filter ‘auto_scaler_0’ between the filter ‘graph 0 input from stream 0:0’ and the filter ‘Parsed_format_0’
[AVFilterGraph @ 0x42ca360] query_formats: 6 queried, 4 merged, 1 already done, 0 delayed
[auto_scaler_0 @ 0x42cd580] w:3840 h:2026 fmt:yuv420p sar:1/1 -> w:3840 h:2026 fmt:nv12 sar:1/1 flags:0x4
[hwupload @ 0x42cbcc0] Surface format is nv12.
[AVHWFramesContext @ 0x42ccbc0] Created surface 0x4000000.
[AVHWFramesContext @ 0x42ccbc0] Direct mapping possible.
[AVHWFramesContext @ 0x42c3e40] Created surface 0x4000001.
[AVHWFramesContext @ 0x42c3e40] Direct mapping possible.
[AVHWFramesContext @ 0x42c3e40] Created surface 0x4000002.
[AVHWFramesContext @ 0x42c3e40] Created surface 0x4000003.
[AVHWFramesContext @ 0x42c3e40] Created surface 0x4000004.
[AVHWFramesContext @ 0x42c3e40] Created surface 0x4000005.
[AVHWFramesContext @ 0x42c3e40] Created surface 0x4000006.
[AVHWFramesContext @ 0x42c3e40] Created surface 0x4000007.
[AVHWFramesContext @ 0x42c3e40] Created surface 0x4000008.
[AVHWFramesContext @ 0x42c3e40] Created surface 0x4000009.
[AVHWFramesContext @ 0x42c3e40] Created surface 0x400000a.
[vp9_vaapi @ 0x409da40] Encoding entrypoint not found (19 / 6).
Error initializing output stream 0:0 — Error while opening encoder for output stream #0:0 – maybe incorrect parameters such as bit_rate, rate, width or height
[AVIOContext @ 0x40fdac0] Statistics: 0 seeks, 0 writeouts
[aac @ 0x40fcb00] Qavg: -nan
[AVIOContext @ 0x409f820] Statistics: 32768 bytes read, 0 seeks
Conversion failed!

The interesting bits are the entrypoint warnings for VP9 encoding being absent on this particular platform, as confirmed by vainfo’s output:

libva info: VA-API version 0.40.0
libva info: va_getDriverName() returns 0
libva info: Trying to open /usr/local/lib/dri/i965_drv_video.so
libva info: Found init function __vaDriverInit_0_40
libva info: va_openDriver() returns 0
vainfo: VA-API version: 0.40 (libva 1.7.3)
vainfo: Driver version: Intel i965 driver for Intel(R) Skylake – 1.8.4.pre1 (glk-alpha-71-gc3110dc)
vainfo: Supported profile and entrypoints
VAProfileMPEG2Simple : VAEntrypointVLD
VAProfileMPEG2Simple : VAEntrypointEncSlice
VAProfileMPEG2Main : VAEntrypointVLD
VAProfileMPEG2Main : VAEntrypointEncSlice
VAProfileH264ConstrainedBaseline: VAEntrypointVLD
VAProfileH264ConstrainedBaseline: VAEntrypointEncSlice
VAProfileH264ConstrainedBaseline: VAEntrypointEncSliceLP
VAProfileH264Main : VAEntrypointVLD
VAProfileH264Main : VAEntrypointEncSlice
VAProfileH264Main : VAEntrypointEncSliceLP
VAProfileH264High : VAEntrypointVLD
VAProfileH264High : VAEntrypointEncSlice
VAProfileH264High : VAEntrypointEncSliceLP
VAProfileH264MultiviewHigh : VAEntrypointVLD
VAProfileH264MultiviewHigh : VAEntrypointEncSlice
VAProfileH264StereoHigh : VAEntrypointVLD
VAProfileH264StereoHigh : VAEntrypointEncSlice
VAProfileVC1Simple : VAEntrypointVLD
VAProfileVC1Main : VAEntrypointVLD
VAProfileVC1Advanced : VAEntrypointVLD
VAProfileNone : VAEntrypointVideoProc
VAProfileJPEGBaseline : VAEntrypointVLD
VAProfileJPEGBaseline : VAEntrypointEncPicture
VAProfileVP8Version0_3 : VAEntrypointVLD
VAProfileVP8Version0_3 : VAEntrypointEncSlice
VAProfileHEVCMain : VAEntrypointVLD
VAProfileHEVCMain : VAEntrypointEncSlice
VAProfileVP9Profile0 : VAEntrypointVLD

The VLD (for Variable Length Decode) entry point for VP9 profile 0 is the furthest that Skylake comes to in terms of VP9 hardware-acceleration.

These with Kabylake test beds, run these encode tests and report back 🙂

Reply

Comment