Sunday, October 22, 2017

Mplayer 0.90

Play some videos

So it is time to test how well this 486 will play mpeg videos. I downloaded oldest version of Mplayer, configured with minimal settings (disabled all that is not needed). Compiled and tested out how well it runs.

First video is without sound.

Second video has sound, reduced to one channel only.


So it can somewhat play, but any larger video then last on here will not play with sound. With -framedrop option it shows black screen and plays audio. Without sound it really slow.

So time to investigate how can i speed this up.
Bad:
  • No hardware yuv
  • No SIMD ( :D)
  • No hardware sound accl
Good
  • Hardware BITBLT

Without profiling mplayer it is clear that mpeg idct takes most of the time.  

static inline void idct_row (int16_t * block)
{
    int x0, x1, x2, x3, x4, x5, x6, x7, x8;
    x1 = block[4] << 11;
    x2 = block[6];
    x3 = block[2];
    x4 = block[1];
    x5 = block[7];
    x6 = block[5];
    x7 = block[3];
    /* shortcut */
    if (! (x1 | x2 | x3 | x4 | x5 | x6 | x7 )) {
block[0] = block[1] = block[2] = block[3] = block[4] =
    block[5] = block[6] = block[7] = block[0]<<3;
return;
    }
    x0 = (block[0] << 11) + 128; /* for proper rounding in the fourth stage */
    /* first stage */
    x8 = W7 * (x4 + x5);
    x4 = x8 + (W1 - W7) * x4;
    x5 = x8 - (W1 + W7) * x5;
    x8 = W3 * (x6 + x7);
    x6 = x8 - (W3 - W5) * x6;
    x7 = x8 - (W3 + W5) * x7;

    /* second stage */
    x8 = x0 + x1;
    x0 -= x1;
    x1 = W6 * (x3 + x2);
    x2 = x1 - (W2 + W6) * x2;
    x3 = x1 + (W2 - W6) * x3;
    x1 = x4 + x6;
    x4 -= x6;
    x6 = x5 + x7;
    x5 -= x7;

    /* third stage */
    x7 = x8 + x3;
    x8 -= x3;
    x3 = x0 + x2;
    x0 -= x2;
    x2 = (181 * (x4 + x5) + 128) >> 8;
    x4 = (181 * (x4 - x5) + 128) >> 8;

    /* fourth stage */
    block[0] = (x7 + x1) >> 8;
    block[1] = (x3 + x2) >> 8;
    block[2] = (x0 + x4) >> 8;
    block[3] = (x8 + x6) >> 8;
    block[4] = (x8 - x6) >> 8;
    block[5] = (x0 - x4) >> 8;
    block[6] = (x3 - x2) >> 8;
    block[7] = (x7 - x1) >> 8;
}
So, how to speed this up even more. I did search on internet and it cant be made any faster, at least on 486.
Sure, there must be something that can be done.
I found this (http://jpegclub.org/jidctred/), uses 4x4 or 1x1 transform on 8x8 block.

Can we do this with mpeg 8x8 image block? If DCT coefficients are on left top corner we can reduce transform complexity by setting x1, x2, x4, x7 to zero. If we compile with -O3 option we get faster code, as gcc optimizes out some code. Do the same with column filter. Yes, it will reduce video quality.

Success. It plays better. 

Tested with only 2x2 block, but quality of the video was not really good. With the last video. It was blocky. First video played smoothly.

Another problem is gcc compiled code for 486. According to AMD documents and Intels there cpus have pipeline, and gcc compiled code with -O3 is not very good. There are probably AGI stalls in lots of places. (https://www.gamedev.net/articles/programming/general-and-gameplay-programming/optimize-386486pentium-code-r206/) Every clock tick counts.
If you compile for -march=i486 or i386 you get identical code.
So i tested with i586. Code is rearranged differently. Made test program with this idct and with i586 it was better. 18,8 sec vs 18,06 sec.

Last thing is to profile. And result shows that another cpu eater is yuv2rgb_, maybe newer version of mplayer has better conversion?

Sound

Mplayer uses mpg123 for mpeg audio decoding. It uses old version, 0.50. I need to replace this.


No comments:

Post a Comment