Thomas Gleixner 54ff7e595d x86: hpet: Work around hardware stupidity
This more or less reverts commits 08be979 (x86: Force HPET
readback_cmp for all ATI chipsets) and 30a564be (x86, hpet: Restrict
read back to affected ATI chipsets) to the status of commit 8da854c
(x86, hpet: Erratum workaround for read after write of HPET
comparator).

The delta to commit 8da854c is mostly comments and the change from
WARN_ONCE to printk_once as we know the call path of this function
already.

This needs really in depth explanation:

First of all the HPET design is a complete failure. Having a counter
compare register which generates an interrupt on matching values
forces the software to do at least one superfluous readback of the
counter register.

While it is nice in theory to program "absolute" time events it is
practically useless because the timer runs at some absurd frequency
which can never be matched to real world units. So we are forced to
calculate a relative delta and this forces a readout of the actual
counter value, adding the delta and programming the compare
register. When the delta is small enough we run into the danger that
we program a compare value which is already in the past. Due to the
compare for equal nature of HPET we need to read back the counter
value after writing the compare rehgister (btw. this is necessary for
absolute timeouts as well) to make sure that we did not miss the timer
event. We try to work around that by setting the minimum delta to a
value which is larger than the theoretical time which elapses between
the counter readout and the compare register write, but that's only
true in theory. A NMI or SMI which hits between the readout and the
write can easily push us beyond that limit. This would result in
waiting for the next HPET timer interrupt until the 32bit wraparound
of the counter happens which takes about 306 seconds.

So we designed the next event function to look like:

   match = read_cnt() + delta;
   write_compare_ref(match);
   return read_cnt() < match ? 0 : -ETIME;

At some point we got into trouble with certain ATI chipsets. Even the
above "safe" procedure failed. The reason was that the write to the
compare register was delayed probably for performance reasons. The
theory was that they wanted to avoid the synchronization of the write
with the HPET clock, which is understandable. So the write does not
hit the compare register directly instead it goes to some intermediate
register which is copied to the real compare register in sync with the
HPET clock. That opens another window for hitting the dreaded "wait
for a wraparound" problem.

To work around that "optimization" we added a read back of the compare
register which either enforced the update of the just written value or
just delayed the readout of the counter enough to avoid the issue. We
unfortunately never got any affirmative info from ATI/AMD about this.

One thing is sure, that we nuked the performance "optimization" that
way completely and I'm pretty sure that the result is worse than
before some HW folks came up with those.

Just for paranoia reasons I added a check whether the read back
compare register value was the same as the value we wrote right
before. That paranoia check triggered a couple of years after it was
added on an Intel ICH9 chipset. Venki added a workaround (commit
8da854c) which was reading the compare register twice when the first
check failed. We considered this to be a penalty in general and
restricted the readback (thus the wasted CPU cycles) to the known to
be affected ATI chipsets.

This turned out to be a utterly wrong decision. 2.6.35 testers
experienced massive problems and finally one of them bisected it down
to commit 30a564be which spured some further investigation.

Finally we got confirmation that the write to the compare register can
be delayed by up to two HPET clock cycles which explains the problems
nicely. All we can do about this is to go back to Venki's initial
workaround in a slightly modified version.

Just for the record I need to say, that all of this could have been
avoided if hardware designers and of course the HPET committee would
have thought about the consequences for a split second. It's out of my
comprehension why designing a working timer is so hard. There are two
ways to achieve it:

 1) Use a counter wrap around aware compare_reg <= counter_reg
    implementation instead of the easy compare_reg == counter_reg

    Downsides:

	- It needs more silicon.

	- It needs a readout of the counter to apply a relative
	  timeout. This is necessary as the counter does not run in
	  any useful (and adjustable) frequency and there is no
	  guarantee that the counter which is used for timer events is
	  the same which is used for reading the actual time (and
	  therefor for calculating the delta)

    Upsides:

	- None

  2) Use a simple down counter for relative timer events

    Downsides:

	- Absolute timeouts are not possible, which is not a problem
	  at all in the context of an OS and the expected
	  max. latencies/jitter (also see Downsides of #1)

   Upsides:

	- It needs less or equal silicon.

	- It works ALWAYS

	- It is way faster than a compare register based solution (One
	  write versus one write plus at least one and up to four
	  reads)

I would not be so grumpy about all of this, if I would not have been
ignored for many years when pointing out these flaws to various
hardware folks. I really hate timers (at least those which seem to be
designed by janitors).

Though finally we got a reasonable explanation plus a solution and I
want to thank all the folks involved in chasing it down and providing
valuable input to this.

Bisected-by: Nix <nix@esperi.org.uk>
Reported-by: Artur Skawina <art.08.09@gmail.com>
Reported-by: Damien Wyart <damien.wyart@free.fr>
Reported-by: John Drescher <drescherjm@gmail.com>
Cc: Venkatesh Pallipadi <venki@google.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Andreas Herrmann <andreas.herrmann3@amd.com>
Cc: Borislav Petkov <borislav.petkov@amd.com>
Cc: stable@kernel.org
Acked-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-09-15 00:55:13 +02:00

117 lines
3.3 KiB
C

#ifndef _ASM_X86_HPET_H
#define _ASM_X86_HPET_H
#include <linux/msi.h>
#ifdef CONFIG_HPET_TIMER
#define HPET_MMAP_SIZE 1024
#define HPET_ID 0x000
#define HPET_PERIOD 0x004
#define HPET_CFG 0x010
#define HPET_STATUS 0x020
#define HPET_COUNTER 0x0f0
#define HPET_Tn_CFG(n) (0x100 + 0x20 * n)
#define HPET_Tn_CMP(n) (0x108 + 0x20 * n)
#define HPET_Tn_ROUTE(n) (0x110 + 0x20 * n)
#define HPET_T0_CFG 0x100
#define HPET_T0_CMP 0x108
#define HPET_T0_ROUTE 0x110
#define HPET_T1_CFG 0x120
#define HPET_T1_CMP 0x128
#define HPET_T1_ROUTE 0x130
#define HPET_T2_CFG 0x140
#define HPET_T2_CMP 0x148
#define HPET_T2_ROUTE 0x150
#define HPET_ID_REV 0x000000ff
#define HPET_ID_NUMBER 0x00001f00
#define HPET_ID_64BIT 0x00002000
#define HPET_ID_LEGSUP 0x00008000
#define HPET_ID_VENDOR 0xffff0000
#define HPET_ID_NUMBER_SHIFT 8
#define HPET_ID_VENDOR_SHIFT 16
#define HPET_ID_VENDOR_8086 0x8086
#define HPET_CFG_ENABLE 0x001
#define HPET_CFG_LEGACY 0x002
#define HPET_LEGACY_8254 2
#define HPET_LEGACY_RTC 8
#define HPET_TN_LEVEL 0x0002
#define HPET_TN_ENABLE 0x0004
#define HPET_TN_PERIODIC 0x0008
#define HPET_TN_PERIODIC_CAP 0x0010
#define HPET_TN_64BIT_CAP 0x0020
#define HPET_TN_SETVAL 0x0040
#define HPET_TN_32BIT 0x0100
#define HPET_TN_ROUTE 0x3e00
#define HPET_TN_FSB 0x4000
#define HPET_TN_FSB_CAP 0x8000
#define HPET_TN_ROUTE_SHIFT 9
/* Max HPET Period is 10^8 femto sec as in HPET spec */
#define HPET_MAX_PERIOD 100000000UL
/*
* Min HPET period is 10^5 femto sec just for safety. If it is less than this,
* then 32 bit HPET counter wrapsaround in less than 0.5 sec.
*/
#define HPET_MIN_PERIOD 100000UL
/* hpet memory map physical address */
extern unsigned long hpet_address;
extern unsigned long force_hpet_address;
extern u8 hpet_blockid;
extern int hpet_force_user;
extern u8 hpet_msi_disable;
extern int is_hpet_enabled(void);
extern int hpet_enable(void);
extern void hpet_disable(void);
extern unsigned int hpet_readl(unsigned int a);
extern void force_hpet_resume(void);
extern void hpet_msi_unmask(unsigned int irq);
extern void hpet_msi_mask(unsigned int irq);
extern void hpet_msi_write(unsigned int irq, struct msi_msg *msg);
extern void hpet_msi_read(unsigned int irq, struct msi_msg *msg);
#ifdef CONFIG_PCI_MSI
extern int arch_setup_hpet_msi(unsigned int irq, unsigned int id);
#else
static inline int arch_setup_hpet_msi(unsigned int irq, unsigned int id)
{
return -EINVAL;
}
#endif
#ifdef CONFIG_HPET_EMULATE_RTC
#include <linux/interrupt.h>
typedef irqreturn_t (*rtc_irq_handler)(int interrupt, void *cookie);
extern int hpet_mask_rtc_irq_bit(unsigned long bit_mask);
extern int hpet_set_rtc_irq_bit(unsigned long bit_mask);
extern int hpet_set_alarm_time(unsigned char hrs, unsigned char min,
unsigned char sec);
extern int hpet_set_periodic_freq(unsigned long freq);
extern int hpet_rtc_dropped_irq(void);
extern int hpet_rtc_timer_init(void);
extern irqreturn_t hpet_rtc_interrupt(int irq, void *dev_id);
extern int hpet_register_irq_handler(rtc_irq_handler handler);
extern void hpet_unregister_irq_handler(rtc_irq_handler handler);
#endif /* CONFIG_HPET_EMULATE_RTC */
#else /* CONFIG_HPET_TIMER */
static inline int hpet_enable(void) { return 0; }
static inline int is_hpet_enabled(void) { return 0; }
#define hpet_readl(a) 0
#endif
#endif /* _ASM_X86_HPET_H */