BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:Europe/Stockholm
X-LIC-LOCATION:Europe/Stockholm
BEGIN:DAYLIGHT
TZOFFSETFROM:+0100
TZOFFSETTO:+0200
TZNAME:CEST
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=-1SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:+0200
TZOFFSETTO:+0100
TZNAME:CET
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=10;BYDAY=-1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20190719T085744Z
LOCATION:HG F 3
DTSTART;TZID=Europe/Stockholm:20190613T174500
DTEND;TZID=Europe/Stockholm:20190613T181500
UID:submissions.pasc-conference.org_PASC19_sess161_msa308@linklings.com
SUMMARY:Machine Learning Near-Optimal Parameters for Small Matrix-Matrix M
 ultiplication Kernels on GPUs
DESCRIPTION:Minisymposium\nComputer Science and Applied Mathematics\n\nMac
 hine Learning Near-Optimal Parameters for Small Matrix-Matrix Multiplicati
 on Kernels on GPUs\n\nJakobovits\n\nParameterized kernels are an important
  tool to achieve high performance in scientific computing. Auto-tuning all
  kernels with an exhaustive search in parameter space is often prohibitive
 ly expensive in time and compute resources. Here, we use machine learning 
 to derive a performance model from a subset of tuning data that accurately
  predicts performance over the complete kernel set. This makes determining
  near-optimal parameters for the entire space cost-effective. In this appl
 ication, small matrix-matrix multiplications are parameterized, for exampl
 e, by the number of thread-blocks, tiling sizes, and matrix read/write str
 ategies, yielding 10'000s of parameter sets per kernel. The optimal parame
 ters depend sensitively on the three matrix dimensions that define the pro
 duct. The model predicts performance based on matrix dimensions and parame
 ters, leveraging hardware knowledge, such as register count, shared memory
  size. Near-optimal parameter sets are determined for 90'000 kernels. On a
 verage, the resulting performance is within 3% of the true optimum, and co
 nsistently outperforms an expert-crafted baseline result by 20%. The resul
 t of this work is integrated in DBCSR, a sparse matrix-matrix multiplicati
 on library used in state-of-the-art HPC applications such as CP2K, where i
 n combination with just-in-time compilation, user applications experience 
 a 3-fold speedup. In ongoing work, we explore transfer learning across dif
 ferent GPU architectures.
END:VEVENT
END:VCALENDAR

