Archive for April, 2010

Adding evergreen support part 1

Wednesday, April 28th, 2010

As promised, the following details, on a patch by patch basis, how support for a new asic is added to the radeon drm.  This post covers the patches in http://people.freedesktop.org/~agd5f/evergreen_kms/set1/ which add basic modesetting support for evergreen asics.

0001-drm-radeon-kms-whitespace-changes-to-atombios.h.patch
0002-drm-radeon-kms-pull-in-the-latest-upstream-atombios.patch

0001 is just a cleanup of atombios.h to make merging upstream changes easier.  0002 brings in the atombios.h changes needed for evergreen asics.  The atombios rom on evergreen asics has updated data and command tables and this patch adds the definitions needed for properly using those tables.

0003-drm-radeon-kms-evergreen-add-chip-enums.patch

0003 adds the chip family enums used in the driver.  The chip family is set based on the pci id and is used to differentiate between different asic families in the driver.  All evergreen chips have a common display block (called DCE version 4).  This patch also adds a convenience macro, ASIC_IS_DCE4.  This is mainly used in the modesetting code to select the appropriate code paths for evergreen chips.

0004-drm-radeon-kms-evergreen-add-modesetting-register-a.patch

0004 adds the modesetting register definitions that will be used in later patches.

0005-drm-radeon-kms-evergreen-add-initial-support.patch

0005 is the initial evergreen support patch.  It adds a new file, evergreen.c, that has the evergreen specific asic functions defined in it.  At this point all that is implemented is support for setting up the MC (memory controller) and setting the proper number of crtcs (6 rather than 2 as on previous asics).  There is still no modesetting support yet.

0006-drm-radeon-kms-evergreen-don-t-enable-vblanks-yet.patch

0006 disables the vblank support functions in the crtc dpms code for evergreen asics (note the use of the ASIC_IS_DCE4 macro).  These are disabled for now as the driver does not implement interrupt support at this point (interrupt support is added in patch set2, I’ll go into that in a future post).  Leaving them enabled without support for vblank interrupts would result in a potential hang when that code is called.

0007-drm-radeon-kms-evergreen-add-LUT-support.patch

0007 adds support for the evergreen crtc color lookup tables.  These tables are used to adjust the colors sent from the crtc to the monitor.  They are the mechanism used to implement support for gamma correction using xgamma or xrandr.

0008-drm-radeon-kms-evergreen-add-hw-cursor-support.patch

0008 adds support for evergreen hardware cursors.  Hardware cursors are a special image that gets blended into the data stream sent from the crtc to the monitor.  Hardware cursors are not actually drawn on the framebuffer, rather they are drawn to a separate buffer and blended into the framebuffer data stream at the position specified.

0009-drm-radeon-kms-evergreen-add-crtc_set_base-function.patch

0009 adds support for setting the non-timing related aspects of the crtc: base address, x/y offset, pitch, and data format. The base address specifies the location in vram where the crtc will be reading data out of to be sent to the monitor.  For multi-head, each crtc points to a different location in vram; for clone modes, each crtc points to the same location.  The data format defines the number of bits per pixel and the ordering of the channels (32 bits per pixel, 8 bits per channel, ARGB ordering for example).  The pitch is the width of the buffer the crtc is scanning out of.  Pitch is used to tell the crtc where the next scanline begins.  Finally the x/y offset is used to implement things like virtual desktops (i.e., the mode on your monitor is like a window looking into a larger desktop; when you move the cursor to the edge of the window, the window moves to show the rest of the desktop).  It’s basically an offset that’s added to the base address so that the crtc will scan out a slightly different location in vram.

Additionally this patch selects the proper crtc timing command table to use.  The crtc timing and scaler setup command table parameter structures did not have any major changes for evergreen, so the existing code could be used for them.

0010-drm-radeon-kms-evergreen-add-support-for-digital-ou.patch

0010 adds atombios support for digital encoders (TMDS, HDMI, LVDS, DisplayPort).  The digital encoder and transmitter control table parameters changed to accommodate changes in the hardware.  The new parameter formats were added in atombios.h updates in patch 0002.  This patch sets up the new parameters properly so that the digital outputs can be properly enabled and disabled.  The analog encoder table parameters did not change for evergreen, so there is nothing to change for evergreen.

0011-drm-radeon-kms-evergreen-add-support-for-atom-Adjus.patch

The PLLs used for the crtc and transmitter clocks need to be adjusted in some cases.  This can include making changes to the requested clock itself, or specifying certain restrictions when selecting the PLL dividers.  0011 adds evergreen support to the PLL adjust function.

0012-drm-radeon-kms-evergreen-add-support-for-pll-setup.patch

0012 adds support for actually programming the display PLLs.  As I mentioned previously, the PLL setup is fairly complex on evergreen in order to support up to 6 independent monitors.  This patch adds support for changes in the SetPixelClock command table parameters as well as general infrastructure changes to handle PLL allocation properly.

0013-drm-radeon-kms-add-evergreen-pci-ids.patch

Finally 0013 adds the new evergreen PCI ids. With the ids added, the driver will now load on evergreen hardware.

That covers basic modesetting support.  The next set, http://people.freedesktop.org/~agd5f/evergreen_kms/set2/ focuses on the infrastructure needed to support acceleration.  I’ll cover that in a future post.

Understanding GPUs from the ground up

Thursday, April 15th, 2010

I get asked a lot about learning how to program GPUs.  Bringing up evergreen kms support seems like a good place to start, so I figured I write a series of articles detailing the process based on the actual evergreen patches.  First, to get a better understanding of how GPUs work, take a look at the radeon drm.  This article assumes a basic understanding of C and computer architectures.  The basic process is that the driver loads, initializes the hardware, sets up non-hw specific things like the memory manager, and sets up the displays.  This first article describes the basic driver flow when the drm loads in kms mode.

radeon_driver_load_kms() (in radeon_kms.c) is where everything starts.  It calls radeon_device_init() to initialize the non-display hardware and radeon_modeset_init() (in radeon_display.c) to initialize the display hardware.

The main workhorse of the driver initialization is radeon_device_init() found in radeon_device.c.  First we initialize a bunch of the structs used in the driver.  Then radeon_asic_init() is called. This function sets up the asic specific function pointers for various things such as suspend/resume callbacks, asic reset, set/process irqs, set/get engine clocks, etc.  The common code then uses these callbacks to call the asic specific code to achieve the requested functionality.  For example, enabling and processing interrupts works differently on a RV100 vs. a RV770.  Since functionality changes in stages, some routines are used for multiple asic families.  This lets us mix and match the appropriate functions for the specifics of how the chip is programmed.  For example, both R1xx and R3xx chips both use the same interrupt scheme (as defined in r100_irq_set()/r100_irq_process()), but they have different initialization routines (r100_init() vs. r300_init()).

Next we set up the DMA masks for the driver.  These let the kernel know what size address space the the card is able to address.  In the case of radeons, it’s used for GPU access to graphics buffers stored in system memory which are accessed via a GART (Graphics Address Remapping Table).  AGP and the older on-chip GART mechanisms are limited to 32 bits.  Newer on-chip GART mechanisms have larger address spaces.

After DMA masks, we set up the MMIO aperture.  PCI/PCIE/AGP devices are programmed via apertures called BARs (Base Address Register).  There apertures provide access to resources on the card such as registers, framebuffers, and roms.  GPUs are configured via registers, if you want to access those registers, you’d map the register BAR.  If you want to write to the framebuffer (some of which may be displayed on your screen), you would map the framebuffer BAR.  In this case we map the register BAR; this register mapping is then used by the driver to configure the card.

vga_client_register() comes next, and is beyond the scope of this article.  It’s basically a way to work around the limitations of VGA on PCI buses with multiple VGA devices.

Next up is radeon_init().  This is actually a macro defined in radeon.h that references the asic init callback we initialized in  radeon_asic_init() several steps ago.  The asic specific init function is called.  For an RV100, it would be r100_init() defined in r100.c, for RV770, it’s rv770_init().

That’s pretty much it for  radeon_device_init().  Next let’s look at what happens in the asic specific init functions.  They all follow the same pattern, although some asics may do more or less depending on the functionality.  Let’s take a look at r100_init() in r100.c.  First we initialize debugfs; this is a kernel debugging framework and outside the scope of this article.  Next we call r100_vga_render_disable() this disables the VGA engine on the card.  The VGA engine provides VGA compatibility; since we are going to be programming the card directly, we disable it.

Following that, we set up the GPU scratch registers (radeon_scratch_init() defined in radeon_device.c).  These are scratch registers used by the CP (Command Processor) to to signal graphics events.  In general they are used for what we call fences.  A write to one of these scratch registers can be added to the command stream sent to the GPU.  When it encounters that command, it writes the value specified to that scratch register.  The driver can then check the value of the scratch register to determine whether that fence has come up or not.  For example, if you want to know if the GPU is done rendering to a buffer, you’d insert a fence after the rendering commands.  You can then check the scratch register to determine if that fence has passed (and hence the rendering is done).

radeon_get_bios() loads the video bios from the PCI ROM BAR.  The video bios contains data and command tables.  The data tables define things like the number and type of connectors on the card and how those connectors are mapped to encoders, the GPIO registers and bitfields used for DDC and other i2c buses, LVDS panel information for laptops, display and engine PLL limits, etc.  The command tables are used for initializing the hardware (normally done by the system bios during post, but required for things like suspend/resume and initializing secondary cards), and on systems with ATOM bios the command tables are used for setting up the displays and changing things like engine and memory clocks.

Next, we initialize the bios scratch registers (radeon_combios_initialize_bios_scratch_regs() via radeon_combios_init()).  These registers are a way for the firmware on the system to communicate state to the graphics driver.  They contain things like connected outputs, whether the driver or the firmware will handle things like lid or mode change events, etc.

radeon_boot_test_post_card() checks to see whether the system bios has posted the card or not.  This is used to determine whether the card needs to be initialized by the driver using the bios command tables or if the system bios as already done it.

radeon_get_clock_info() gets the PLL (Phase Locked Loop, used to generate clocks) information from the bios tables.  This includes the display PLLs, engine and memory PLLs and the reference clock that the PLLs use to generate their final clocks.

radeon_pm_init() initializes the power management features of the chip.

Next the MC (Memory Controller) is initialized (r100_mc_init()).  The GPU has it’s own address space similar to the CPU.  Within that address space you map VRAM and GART.  The blocks on the chip (2D, 3D engines, display controllers, etc.) access these resources via the GPU’s address space.  VRAM is mapped at one offset and GART at another.  If you want to read from a texture located in GART memory, you’d point the texture base address at some offset in the GART aperture in the GPU’s address space.  If you want to display a buffer in VRAM on your monitor, you’d point one of your crtc base addresses to an address in the VRAM aperture in the GPU’s address space.  The MC init function determines how much VRAM is on the card where to place VRAM and GART in the GPU’s address space.

radeon_fence_driver_init() initializes the common code used for fences.  See above for more on fences.

radeon_irq_kms_init() initializes the common code used for irqs.

radeon_bo_init() initializes the memory manager.

r100_pci_gart_init() sets up the on board GART mechanism and radeon_agp_init() initializes AGP GART.  This allows the GPU to access buffers in system memory.  Since system memory is paged, large allocations are not contiguous.  The GART provides a way to make many disparate pages look like one contiguous block by using address remapping.  With AGP, the northbridge provides the the address remapping, and you just point the GPU’s AGP aperture at the one provided by the northbridge.  The on-board GART provides the same functionality for non-AGP systems (PCI or PCIE).

Next up we have  r100_set_safe_registers().  This function sets the list of registers that command buffers from userspace are allowed to access.  When a userspace driver like the ddx (2D) or mesa (3D) sends commands to the GPU, the drm checks those command buffers to prevent access to unauthorized registers or memory.

Finally, r100_startup() programs the hardware with everything set up in r100_init().  It’s a separate function since it’s also called when resuming from suspend as the current hardware configuration needs to be restored in that case as well.  The VRAM and GART setup is programmed in r100_mc_program() and r100_pci_gart_enable(); irqs are setup in r100_irq_set().

r100_cp_init() initializes the CP and sets up the ring buffer.  The CP is the part of the chip that feeds acceleration commands to the GPU.  It’s fed by a ring buffer that the driver (CPU) writes to and the GPU reads from.  Besides commands, you can also write pointers to command buffers stored elsewhere in the GPU’s address space (called an indirect buffer).  For example, the 3D driver might send a command buffer to the drm; after checking it, the drm would put a pointer to that command buffer on the ring, followed by a fence.  When the CP gets to the pointer in the ring, it fetches the command buffer and processes the commands in it, then returns to where it left off in the ring.  Buffers referenced by the command buffer are “locked”until the fence passes since the GPU is accessing them in the execution of those commands.

r100_wb_init() initializes scratch register writeback which is a feature that lets the GPU update copies of the scratch registers in GART memory.  This allows the driver (running on the CPU) to access the content of those registers without having to read them from the MMIO register aperture which requires a trip across the bus.

r100_ib_init initializes the indirect buffers used for feeding command buffers to the CP from userspace drivers like the 3D driver.

The display side is set up in  radeon_modeset_init().  First we set up the display limits and mode callbacks, then we set up the output properties (radeon_modeset_create_props()) that are exposed via xrandr properties when X is running.

Next, we initialize the crtcs in radeon_crtc_init().  crtcs (also called display controllers) are the blocks on the chip that provide the display timing and determine where in the framebuffer a particular monitor points to.  A crtc provides an independent “head.”  Most radeon asics have two crtcs; the new evergreen chips have six.

radeon_setup_enc_conn() sets up the connector and encoder mappings based on video bios data tables.  Encoders are things like DACs for analog outputs like VGA and TV, and TMDS or LVDS encoders for things like digital DVI or LVDS panels.  An encoder can be tied to one or more connectors (e.g., the TV DAC is often tied to both the S-video and a VGA port or the analog portion of a DVI-I port).  The mapping is important as you need to know what encoders are in use and what they are tied to in order to program the displays properly.

radeon_hpd_init() is a macro that points to the asic specific function to initializes the HPD (Hot Plug Detect) hardware for digital monitors. HPD allows you to get an interrupt when a digital monitor is connected or disconnected.  When this happens the driver will take appropriate action and generate an event which userspace apps can listen for.  The app can then display a message asking the user what they want to do, etc.

Finally,  radeon_fbdev_init() sets up the drm kernel fb interface.  This provides a kernel fb interface on top of the drm for the console or other kernel fb apps.

When the driver is unloaded the whole process happens in reverse; this time all the *_fini() functions are called to tear down the driver.

The next set of articles will walk through the evergreen patches available here which have already been applied upstream and explain what each patch does to bring up support for evergreen chips.

Notes about radeon display hardware

Thursday, April 15th, 2010

Display routing can be confusing so here are the basics.  The simplified route from framebuffer to monitor looks like this:

framebuffer -> crtc -> encoder -> transmitter -> connector -> monitor

The framebuffer is just a buffer vram that has an image encoded in it as an array of pixels.

The crtc reads the data out of the framebuffer and generates the video mode timing in conjunction with a PLL.  The crtc also determines what part of the framebuffer is read; e.g., when multi-head is enabled, each crtc scans out of a different part of vram; in clone mode, each crtc scans out of the same part of vram.

The encoder takes the digital bitstream from the crtc and converts it to the appropriate format for the requested output.

The transmitter takes the digital representation and converts that to the appropriate analog levels for transmission across the connector to the monitor.

The connector provides the appropriate plug for the monitor to connect to.

Radeon Hardware:

On older asics (radeon (r1xx-r4xx), DCE 1.x (r5xx) and DCE 2.x (early R6xx/RS600/RS690/RS740)) the encoders and transmitters
tended to be one combined block that supported a single output type; e.g., a TMDS/HDMI block or an LVDS block.  The DCE1.x/2.x LVTMA block was kind of an exception in that it could support both TMDS and LVDS signaling, but the encoder part and the transmitter part were not routeable.  The LVTMA encoder was hardwired to the LVTMA transmitter; the TMDSA encoder was hardwired to the TMDSA transmitter, etc.  Analog outputs (DACs) are also generally one block rather than being split into a separate routeable encoder and transmitter since there are fewer types of analog outputs (pretty much just TV and VGA).  The digital outputs were split into separate encoders and transmitters because there are lots of different types of digital outputs required for systems (HDMI, DP, DVI, LVDS, eDP, etc.).  The links (A,B,A+B) are required for things like dual link DVI or LVDS, where one link doesn’t provide enough for a particular mode.  In that case, two links are used rather than one to transmit the data to the monitor.  The LVTMA block on DCE3.0 was basically a UNIPHY block that also supported LVDS; DCE3.0 UNIPHY only supported TMDS/HDMI/DP.  On DCE3.2, all the UNIPHY blocks supported all output types (LVDS/TMDS/HDMI/DP/etc.), so there was no need to make a distinction.

One or more transmitters are wired to a connector.  Having support for six transmitters means we can support up to six digital connectors (three transmitters (UNIPHY0/1/2, two links each (A,B)).  So physically your system might look like:
1. DACB + UNIPHY0 links A,B  -> dual link DVI-I port
2. UNIPHY1 link B -> single link DVI-D port
3. UNIPHY1 link A -> HDMI type A port
4. UNIPHY2 link A -> single link LVDS port
5. DACB -> TV port
6. DACA -> VGA port

You need to drive those connectors with timing (crtc), and the proper digital or analog data stream encoding (DIG encoder or DAC), and
appropriate transmission levels (UNIPHY/LVTMA/DAC).  So the logical path would look like:
timing -> encoder -> transmitter -> connector
So for example 2 above the path might look like:
crtc 0 -> DIG1 -> UNIPHY1 link B -> single link DVI-D port
And example 4 might look like:
crtc 0 -> DIG2 -> UNIPHY2 link A -> LVDS port
And example 6 might looks like:
crtc 1 -> DACA -> DACA -> VGA port
Until evergreen, radeon chips only supported two crtcs, so there were only two encoder blocks.  With evergreen, there are six crtcs, so there are also have six digital encoder blocks since you might want to run six independent displays.  In a way evergreen has gone back to being more like the earlier DCE 1.x/2.x designs in that the encoders are not individually routeable anymore; they are hardwired to a particular transmitter.  E.g., on evergreen DIG0/1 are hardcoded to UNIPHY0.  There are two encoders since UNIPHY links A and B can be used independently or combined (for dual link).  In the combined case, you’d only use one encoder, but it would drive both links. On DCE3.x, the encoders and transmitters are separately routeable since you have more transmitters (three transmitters, six possible links) than encoders (two) and you need to be able to drive different combinations.

Evergreen Hardware:

The number of active heads supported depends what connectors the OEMs put on their boards. Generally, most seem to be one DP and several non-DP outputs for a total of three possible independent screens, but you could in theory design a board with more two or more DP outputs and some combination of non-DP ports for between two and six independent screens. However, as you add more possible simultaneous screens, you need more memory bandwidth, so that needs to be taken into account when designing the board. A lot of current boards have two dual-link DVI ports, an HDMI port, and a DP port. That combination uses all the possible encoders/transmitters, so you’d have to give up one of those to add another DP port.

The evergreen hardware has two PLLs, six crtcs, two DACs, and six digital encoders/transmitters (which can be used for LVDS/TMDS/DP/eDP/HDMI). DP runs at a fixed clock, so you don’t need a separate programmable PLL for it. That gives you some combination of up to two non-DP outputs, and up to six DP outputs for a maximum of six possible independent screens. Dual-link DVI ports require two digital transmitters (one for each link), so a dual-link DVI port would use two of the six possible transmitters, leaving four for other digital outputs. Two dual-link DVI ports would use four digital transmitters which would leave two for other digital outputs.  DP only requires 1 transmitter. So in order to use a dual-link DVI monitor on a DP port, you need an active converter since native dual-link is not possible due to the lack of a second transmitter when running in DVI pass-through mode.  You can use a passive DP->DVI converter on any DP port; all of the DP ports support pass-through and can be configured for HDMI or DVI. However, you are limited to two active non-DP monitors (due to there only being two PLLs) at a time. Also the monitors being used for pass-through (passive converter) have to be single link DVI since DP only has one digital transmitter connected to it.  For more than 2 non-DP monitors, or dual-link DVI, you will need an active converter.