Understanding GPUs from the ground up

I get asked a lot about learning how to program GPUs.  Bringing up evergreen kms support seems like a good place to start, so I figured I write a series of articles detailing the process based on the actual evergreen patches.  First, to get a better understanding of how GPUs work, take a look at the radeon drm.  This article assumes a basic understanding of C and computer architectures.  The basic process is that the driver loads, initializes the hardware, sets up non-hw specific things like the memory manager, and sets up the displays.  This first article describes the basic driver flow when the drm loads in kms mode.

radeon_driver_load_kms() (in radeon_kms.c) is where everything starts.  It calls radeon_device_init() to initialize the non-display hardware and radeon_modeset_init() (in radeon_display.c) to initialize the display hardware.

The main workhorse of the driver initialization is radeon_device_init() found in radeon_device.c.  First we initialize a bunch of the structs used in the driver.  Then radeon_asic_init() is called. This function sets up the asic specific function pointers for various things such as suspend/resume callbacks, asic reset, set/process irqs, set/get engine clocks, etc.  The common code then uses these callbacks to call the asic specific code to achieve the requested functionality.  For example, enabling and processing interrupts works differently on a RV100 vs. a RV770.  Since functionality changes in stages, some routines are used for multiple asic families.  This lets us mix and match the appropriate functions for the specifics of how the chip is programmed.  For example, both R1xx and R3xx chips both use the same interrupt scheme (as defined in r100_irq_set()/r100_irq_process()), but they have different initialization routines (r100_init() vs. r300_init()).

Next we set up the DMA masks for the driver.  These let the kernel know what size address space the the card is able to address.  In the case of radeons, it’s used for GPU access to graphics buffers stored in system memory which are accessed via a GART (Graphics Address Remapping Table).  AGP and the older on-chip GART mechanisms are limited to 32 bits.  Newer on-chip GART mechanisms have larger address spaces.

After DMA masks, we set up the MMIO aperture.  PCI/PCIE/AGP devices are programmed via apertures called BARs (Base Address Register).  There apertures provide access to resources on the card such as registers, framebuffers, and roms.  GPUs are configured via registers, if you want to access those registers, you’d map the register BAR.  If you want to write to the framebuffer (some of which may be displayed on your screen), you would map the framebuffer BAR.  In this case we map the register BAR; this register mapping is then used by the driver to configure the card.

vga_client_register() comes next, and is beyond the scope of this article.  It’s basically a way to work around the limitations of VGA on PCI buses with multiple VGA devices.

Next up is radeon_init().  This is actually a macro defined in radeon.h that references the asic init callback we initialized in  radeon_asic_init() several steps ago.  The asic specific init function is called.  For an RV100, it would be r100_init() defined in r100.c, for RV770, it’s rv770_init().

That’s pretty much it for  radeon_device_init().  Next let’s look at what happens in the asic specific init functions.  They all follow the same pattern, although some asics may do more or less depending on the functionality.  Let’s take a look at r100_init() in r100.c.  First we initialize debugfs; this is a kernel debugging framework and outside the scope of this article.  Next we call r100_vga_render_disable() this disables the VGA engine on the card.  The VGA engine provides VGA compatibility; since we are going to be programming the card directly, we disable it.

Following that, we set up the GPU scratch registers (radeon_scratch_init() defined in radeon_device.c).  These are scratch registers used by the CP (Command Processor) to to signal graphics events.  In general they are used for what we call fences.  A write to one of these scratch registers can be added to the command stream sent to the GPU.  When it encounters that command, it writes the value specified to that scratch register.  The driver can then check the value of the scratch register to determine whether that fence has come up or not.  For example, if you want to know if the GPU is done rendering to a buffer, you’d insert a fence after the rendering commands.  You can then check the scratch register to determine if that fence has passed (and hence the rendering is done).

radeon_get_bios() loads the video bios from the PCI ROM BAR.  The video bios contains data and command tables.  The data tables define things like the number and type of connectors on the card and how those connectors are mapped to encoders, the GPIO registers and bitfields used for DDC and other i2c buses, LVDS panel information for laptops, display and engine PLL limits, etc.  The command tables are used for initializing the hardware (normally done by the system bios during post, but required for things like suspend/resume and initializing secondary cards), and on systems with ATOM bios the command tables are used for setting up the displays and changing things like engine and memory clocks.

Next, we initialize the bios scratch registers (radeon_combios_initialize_bios_scratch_regs() via radeon_combios_init()).  These registers are a way for the firmware on the system to communicate state to the graphics driver.  They contain things like connected outputs, whether the driver or the firmware will handle things like lid or mode change events, etc.

radeon_boot_test_post_card() checks to see whether the system bios has posted the card or not.  This is used to determine whether the card needs to be initialized by the driver using the bios command tables or if the system bios as already done it.

radeon_get_clock_info() gets the PLL (Phase Locked Loop, used to generate clocks) information from the bios tables.  This includes the display PLLs, engine and memory PLLs and the reference clock that the PLLs use to generate their final clocks.

radeon_pm_init() initializes the power management features of the chip.

Next the MC (Memory Controller) is initialized (r100_mc_init()).  The GPU has it’s own address space similar to the CPU.  Within that address space you map VRAM and GART.  The blocks on the chip (2D, 3D engines, display controllers, etc.) access these resources via the GPU’s address space.  VRAM is mapped at one offset and GART at another.  If you want to read from a texture located in GART memory, you’d point the texture base address at some offset in the GART aperture in the GPU’s address space.  If you want to display a buffer in VRAM on your monitor, you’d point one of your crtc base addresses to an address in the VRAM aperture in the GPU’s address space.  The MC init function determines how much VRAM is on the card where to place VRAM and GART in the GPU’s address space.

radeon_fence_driver_init() initializes the common code used for fences.  See above for more on fences.

radeon_irq_kms_init() initializes the common code used for irqs.

radeon_bo_init() initializes the memory manager.

r100_pci_gart_init() sets up the on board GART mechanism and radeon_agp_init() initializes AGP GART.  This allows the GPU to access buffers in system memory.  Since system memory is paged, large allocations are not contiguous.  The GART provides a way to make many disparate pages look like one contiguous block by using address remapping.  With AGP, the northbridge provides the the address remapping, and you just point the GPU’s AGP aperture at the one provided by the northbridge.  The on-board GART provides the same functionality for non-AGP systems (PCI or PCIE).

Next up we have  r100_set_safe_registers().  This function sets the list of registers that command buffers from userspace are allowed to access.  When a userspace driver like the ddx (2D) or mesa (3D) sends commands to the GPU, the drm checks those command buffers to prevent access to unauthorized registers or memory.

Finally, r100_startup() programs the hardware with everything set up in r100_init().  It’s a separate function since it’s also called when resuming from suspend as the current hardware configuration needs to be restored in that case as well.  The VRAM and GART setup is programmed in r100_mc_program() and r100_pci_gart_enable(); irqs are setup in r100_irq_set().

r100_cp_init() initializes the CP and sets up the ring buffer.  The CP is the part of the chip that feeds acceleration commands to the GPU.  It’s fed by a ring buffer that the driver (CPU) writes to and the GPU reads from.  Besides commands, you can also write pointers to command buffers stored elsewhere in the GPU’s address space (called an indirect buffer).  For example, the 3D driver might send a command buffer to the drm; after checking it, the drm would put a pointer to that command buffer on the ring, followed by a fence.  When the CP gets to the pointer in the ring, it fetches the command buffer and processes the commands in it, then returns to where it left off in the ring.  Buffers referenced by the command buffer are “locked”until the fence passes since the GPU is accessing them in the execution of those commands.

r100_wb_init() initializes scratch register writeback which is a feature that lets the GPU update copies of the scratch registers in GART memory.  This allows the driver (running on the CPU) to access the content of those registers without having to read them from the MMIO register aperture which requires a trip across the bus.

r100_ib_init initializes the indirect buffers used for feeding command buffers to the CP from userspace drivers like the 3D driver.

The display side is set up in  radeon_modeset_init().  First we set up the display limits and mode callbacks, then we set up the output properties (radeon_modeset_create_props()) that are exposed via xrandr properties when X is running.

Next, we initialize the crtcs in radeon_crtc_init().  crtcs (also called display controllers) are the blocks on the chip that provide the display timing and determine where in the framebuffer a particular monitor points to.  A crtc provides an independent “head.”  Most radeon asics have two crtcs; the new evergreen chips have six.

radeon_setup_enc_conn() sets up the connector and encoder mappings based on video bios data tables.  Encoders are things like DACs for analog outputs like VGA and TV, and TMDS or LVDS encoders for things like digital DVI or LVDS panels.  An encoder can be tied to one or more connectors (e.g., the TV DAC is often tied to both the S-video and a VGA port or the analog portion of a DVI-I port).  The mapping is important as you need to know what encoders are in use and what they are tied to in order to program the displays properly.

radeon_hpd_init() is a macro that points to the asic specific function to initializes the HPD (Hot Plug Detect) hardware for digital monitors. HPD allows you to get an interrupt when a digital monitor is connected or disconnected.  When this happens the driver will take appropriate action and generate an event which userspace apps can listen for.  The app can then display a message asking the user what they want to do, etc.

Finally,  radeon_fbdev_init() sets up the drm kernel fb interface.  This provides a kernel fb interface on top of the drm for the console or other kernel fb apps.

When the driver is unloaded the whole process happens in reverse; this time all the *_fini() functions are called to tear down the driver.

The next set of articles will walk through the evergreen patches available here which have already been applied upstream and explain what each patch does to bring up support for evergreen chips.

3 Responses to “Understanding GPUs from the ground up”

  1. radeon series Says:

    radeon series…

  2. Moritz Molch » Programming AMD/ATI graphics hardware Says:

    […] http://www.botchco.com/agd5f/?p=50 Categories: Drivers, Linux Tags: AMD, Advanced Micro Devices, Device driver, Graphics processing unit, Linux, Mode-setting, Open source, Operating Systems Comments (0) Trackbacks (0) Leave a comment Trackback […]

  3. pci Says:

    pci…

    Impression to your agd5f ” Blog Archive ” Understanding GPUs from the ground up, but I have more detail about pci at my site….